Skip to content

Conversation

@weefuzzy
Copy link
Member

As discussed. Between folk with new M1 machines and folk with older machines, having AVX on by default creates problems out of proportion to the benefits. This change uses SSE by default instead on x64/86 and people can still set their own preferred options using FLUID_ARCH

@jamesb93 There are SIMD options for ARM by default, but TBH I don't know how sensible they are. Are you using these on your M1 builds?

@weefuzzy weefuzzy self-assigned this Sep 15, 2021
@jamesb93
Copy link
Member

@weefuzzy Would those defaults be activated by passing my own flags to FLUID_ARCH? I am not explicitly modifying any other parts of the built process apart from the OSX_ARCHITECTURES flag. Is there a way to verify if I was using said SIMD ops?

@weefuzzy
Copy link
Member Author

Would those defaults be activated by passing my own flags to FLUID_ARCH?

They would only activate if nothing was passed in FLUID_ARCH. So, if you do a native (i.e. ARM) build on your machine and don't supply anything further, I'd expect them to be on

Is there a way to verify if I was using said SIMD ops?

If you're building with ninja or make, you can get CMake to produce a compile commands JSON file; then grep for the relevant flags. In XCode, you can get to the raw compile commands through the report navigator (speech bubble icon)

@jamesb93 jamesb93 added the enhancement New feature or request label Sep 15, 2021
@jamesb93
Copy link
Member

What are the relevant flags I should grep for?

@weefuzzy
Copy link
Member Author

-march=armv7-a -mtune=cortex-a8 -mfloat-abi=hard -mfpu=neon

@jamesb93
Copy link
Member

I couldn't find anything.

compile_commands.json.zip

@weefuzzy
Copy link
Member Author

Those compile commands are all x64 builds. You'd probably need to delete your CMake cache or whole build tree to reset to doing ARM or both. Although, worryingly I didn't see either AVX or SSE in there either (although perhaps your FLUID_ARCH is already in the cache as being blank?)

@jamesb93
Copy link
Member

It will take a mo ⏲️ but I will iterate through all the possible combos of flags cleaning between each :)

@jamesb93
Copy link
Member

cmake -GNinja -DMAX_SDK_PATH=../sdk -DFLUID_ARCH=-mcpu=native -DCMAKE_OSX_ARCHITECTURES="x86_64" ..

This does not work if I don't pass FLUID_ARCH. It whinges about not being able to build for armv7.

@weefuzzy
Copy link
Member Author

Presumably because it's automatically downloading HISSTools. What happens with

cmake -GNinja -DMAX_SDK_PATH=../sdk -DHISS_PATH=<Some local and up to date HISSTools clone> ..

?

@jamesb93
Copy link
Member

This fails:

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DCMAKE_OSX_ARCHITECTURES="x86_64" .. for the same reason

but this works:

cmake -GNinja -DHISS_PATH=../../hiss -DFLUID_ARCH=-mcpu=native -DMAX_SDK_PATH=../sdk -DCMAKE_OSX_ARCHITECTURES="x86_64" ..

@weefuzzy
Copy link
Member Author

What's the reason though? (and I'm confused about the -DCMAKE_OSX_ARCHITECTURES="x86_64", I thought you were trying an ARM build, but I'm getting lost).

If the HISSTools in ../../hiss is up to date then presumably the hand rolled SIMD stuff is happy, so is it complaining about compiler flags vis-a-vis architecture? What's the error message?

@jamesb93
Copy link
Member

I guess I'm a little lost too 🪘

cmake -GNinja -DHISS_PATH=../../hiss -DFLUID_ARCH=-mcpu=native -DMAX_SDK_PATH=../sdk -DCMAKE_OSX_ARCHITECTURES="arm64" .. makes a working binary using the new beta SDK.

Perhaps you could draw up a list of combinations that you want me to run because I'm getting bogged down in dependency + flags and what is meaningful in helping push forward this side of things.

@weefuzzy
Copy link
Member Author

Ok. One ambition is to make sure FLUID_ARCH is still optional, and then to make sure that on an M1 we can build just ARM, just x64, and produce a fat boi without having to do too much acrobatics.

So, in a nice world, all of the following would work:


cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk  ..

I would expect to produce either an arm or a universal binary


cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DCMAKE_OSX_ARCHITECTURES="arm64"  ..

I would expect just arm64


cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DCMAKE_OSX_ARCHITECTURES="x86_64"  ..

I would expect just x64


If any of them fail, post the error messages and we'll take it from there

@jamesb93
Copy link
Member

And just to clarify which SDK would this point to in each example? AFAIK you cannot make a fat binary with the old sdk.

@weefuzzy
Copy link
Member Author

8.2 in all cases. I'm happy enough that we can build < 8.2 on x64

@jamesb93
Copy link
Member

error1.txt
error2.txt
error3.txt

None of these work and I think its because of the lack of FLUID_ARCH. They are respective to the commands you wrote out.

@weefuzzy
Copy link
Member Author

Only indirectly. For the first two, it doesn't like the flags that I plonked in (for R-PI maybe). For the third, it's still picking up ARM in the platform test, so we'll need to find a better platform test

  1. clang: error: the clang compiler does not support '-march=armv7-a'
  2. clang: error: the clang compiler does not support '-march=armv7-a'
  3. error: unknown target CPU 'armv7-a'

@weefuzzy
Copy link
Member Author

Anyway, thanks for looking at this. I'll borrow my beloved's m1 in due course and have a crack at fixing all this properly

@jamesb93
Copy link
Member

Feel free to bug me with more things to try, and sorry for the confusion somewhere in the middle there 😭

@AlexHarker
Copy link
Contributor

In case it's helpful (and not obvious) I believe that arm64 is the correct thing to pass for ARM M1 builds (that is what is happening in the recent work I did on M1ifying some externals through c74)

@weefuzzy
Copy link
Member Author

Cheers. Yes, it seemed a bit much to expect whatever I'd pasted from somewhere else for R-PI or Bela or something to Just Work 😆 Having skimmed the CMake issues list, it looks like there's possible additional complications based on whether one is running the intel or arm build of CMake on Silicon, and that relying on the processor type isn't going to be useful for managing this, so we'll maybe need a different set of conditions for Apple.

Now. I have access to a Shiny Thing I can mash the keyboard until it works.

@AlexHarker
Copy link
Contributor

The machine you are compiling on is irrelevant. The only relevant thing is your version of Xcode, but older versions simply ignore the arm64 architecture flag. Given that 32 bit Max is no longer a thing that produces acceptable results In all circumstances for the versions of Max you are targeting - no?

@AlexHarker
Copy link
Contributor

(assuming you are passing both flags and making a UB2)

@weefuzzy
Copy link
Member Author

Thanks – Xcode and Cmake don't behave the same here (from what I gathered earlier); the issue isn't the machine – different, and not wholly predictable host processors were getting reported depending on whether CMake was running under Rosetta or not. It's also not clear that one will get UB2 by default without manually setting the flags, the upshot being that I need more intelligent probing CMake code than I currently have.

@AlexHarker
Copy link
Contributor

Well my suggestion is that you manually set the flags in all cases to x86_64 arm64 (a UB2) - what is the case that you don't think this will satisfy? In other words I am unclear as to why you need to know the processor of the machine you are building on. The default for Max OS debug builds is native architecture only, so if you want to avoid building both in that circumstance you need to target configurations, but you still wouldn't need to know the architecture of the build machine.

@weefuzzy
Copy link
Member Author

They came from somewhere. Maybe it was bela not R-PI 🤷 Anyway, the happy news seems to be that I might be able to get rid of some complexity in the build by ditching this whole thing.

@weefuzzy
Copy link
Member Author

@jamesb93 I've now made it so that the silly things shouldn't happen on arm64. So it should work with nothing in FLUID_ARCH for both x64 and arm64. AFAICT, CMake won't set the CMAKE_OSX_ARCHITECTURES to produce a UB2 by default, so if you want a fatty you'll need to pass that yourself.

I have high hopes for:

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DCMAKE_OSX_ARCHITECTURES="arm64;x86_64"  ..

@jamesb93
Copy link
Member

jamesb93 commented Sep 16, 2021

I get this error atm. I've looked at the .cmake and it looks sane to me...

CMake Error at /Users/james/dev/flucoma/core/script/flucoma_simdcmd.cmake:27 (endif):
  Flow control statements are not properly nested.
Call Stack (most recent call first):
  /Users/james/dev/flucoma/core/CMakeLists.txt:184 (include)

EDIT:

here was my build command:

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DFLUID_PATH=../../core -DCMAKE_OSX_ARCHITECTURES="x86_64;arm64" ..

@weefuzzy
Copy link
Member Author

Whoops. Extra endif(). Try now?

@jamesb93
Copy link
Member

jamesb93 commented Sep 16, 2021

I can build an intel 64 bit binary that works under Rosetta:

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../old-sdk -DFLUID_PATH=../../core -DCMAKE_OSX_ARCHITECTURES="x86_64" ..

This is with the 8.0.3 Max SDK. Building a fat binary with the 8.0.3 SDK doesn't work and when I point to the beta 8.2 I get this error:

Generating: fluid.stftpass_tilde
CMake Error at source/script/max-pretarget.cmake:72 (file):
  file STRINGS file
  "/Users/james/dev/flucoma/max/sdk/source/c74support/max-includes/c74_linker_flags.txt"
  cannot be read.
Call Stack (most recent call first):
  source/projects/fluid.stftpass_tilde/CMakeLists.txt:11 (include)

with the following command:

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DFLUID_PATH=../../core -DCMAKE_OSX_ARCHITECTURES="x86_64;arm64" ..

So I assume its not accounting for the change in structure of the SDK (yet?).

@weefuzzy
Copy link
Member Author

Ok. I would expect this to work in conjunction with flucoma/flucoma-max#32 because that deals with the SDK layout changes.

@jamesb93
Copy link
Member

Ok. I would expect this to work in conjunction with flucoma/flucoma-max#32 because that deals with the SDK layout changes.

Testing against flucoma/flucoma-max#32 with:

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DFLUID_PATH=../../core -DCMAKE_OSX_ARCHITECTURES="x86_64;arm64" ..

gives:

clang: error: the clang compiler does not support '-march=core2'
ninja: build stopped: subcommand failed.

@weefuzzy
Copy link
Member Author

Gah. I see what I did wrong there. Try now?

@jamesb93
Copy link
Member

CMake Error at /Users/james/dev/flucoma/core/script/flucoma_simdcmd.cmake:13:
Parse error. Expected a command name, got quoted argument with text
"amd64.|x86_64.|AMD64.|i686.|i386.|x86.".
Call Stack (most recent call first):
/Users/james/dev/flucoma/core/CMakeLists.txt:184 (include)

-- Configuring incomplete, errors occurred!
See also "/Users/james/dev/flucoma/max/build/CMakeFiles/CMakeOutput.log".
See also "/Users/james/dev/flucoma/max/build/CMakeFiles/CMakeError.log"

@weefuzzy
Copy link
Member Author

🤦 and now?

@jamesb93
Copy link
Member

Works to make a fat boi! 🍬

@jamesb93
Copy link
Member

Just testing to see if the old sdk + building an intel binary still works

@jamesb93
Copy link
Member

yes! woo - nice work @weefuzzy

@tremblap
Copy link
Member

Sooooooo my list of tests for this:

  1. does it compile on intel Mac (Mac+SC)
  2. (does it change anything on intel Windows?)

Then we talked about performance loss assessment (to see if that is a real problem) and I had taken notes of that list a long time ago:
-Tests on kr queries to see if it chokes

  • Large and small kdtree queries
  • Test on umap with large and small datasets
  • test on my classic bufnmf

I expect a little loss and that is fine as we agreed. I'll also try with AVX512 to see if that gives any gain for realz

Do I forget anything?

@AlexHarker
Copy link
Contributor

It might be worth noting that the speed differential between SSE and AVX may have complex links to the specifics of the architecture of the testing machine - or in other words if you know how much faster it is on one machine, that might not translate directly to another one.

The gains will also depend on how much of your code is likely to directly make use of AVX. However, the larger vector size only guarantees that equivalent operations can be done on more data with less CPU instructions - it may, however be the case that the AVX instructions take more cycles, and hence the relationship to actual performance gains is complex on these levels (before you even think about memory speed etc.)

If there are significant benefits to AVX (or AVX512) then I can talk through how you might do runtime selection in a manageable fashion, but it is not a simple issue at all as you have to compile different files with different compiler flags, and create variants of the code for all relevant sections, which also bloats code size. The selection code must also use the lowest guaranteed instructions only. The simplest way of achieving all of this, which I've considered for my externals/other projects would be to compile variants of the entire object and then just branch to the right one at initialisation, but that's least efficient in terms of memory used to load the code.

A simpler option is to provide multiple compiles directly, but that won't play nice with the Max package manager etc.

@weefuzzy
Copy link
Member Author

Thanks for those useful notes. I'd certainly need a great deal of convincing to sink time into runtime branching at this point 😆

I'm not expecting huge differences, tbh. We can probably make a decent stab at predicting whether it would make much difference at all by looking at the generated assembly. For instance, notionally the most expensive bit about querying a Kd Tree is the distance calculations, but this may not get vectorised at all (and the cost of the calculation itself might be smaller than the time from cache misses).

So, I agree with the thrust of Alex's advice, as I read it, which is that doing a sanity check is wholly appropriate but there's no point reading too much into the detail of the results. For the future, there is much we could do to find out what the compiler is able to vectorise, and to change code to make its job easier.

@AlexHarker
Copy link
Contributor

I really wouldn't expect auto-vectorisation on things like a kdtree - the compiler is only likely to auto-vectorise on fairly loops dealing with consistently laid out memory - you are unlikely to have much code of that nature, given Gerard's coding style. I have seen little in the codebase that I would expect to auto-vectorise.

Looking at the assembly is IMHO unlikely to tell you much at all, unless you are willing to sit with some reference manuals that tell you how many CPU cycles various instructions take on the kind of hardware you expect to run on - that's a very time-consuming pursuit.

The biggest wins I would expect would likely be in large Eigen operations where the code is explicitly written with intrinsics (there's also some explicitly vectorised stuff in my code also, but most of it affects only the transient objects, with the exception of the FFT on windows - if you have something for which the FFT is a big cost then that would be worth testing). So, things with big matrix operations (there's a solver in the transient stuff, for instance, which is quite a slow part of the code). That might be useful in figuring out which objects are likely to be candidates for more detailed testing.

@AlexHarker
Copy link
Contributor

Ah sorry - the distance measure in KDTree is done using vectors in Eigen, so I would expect an explicit vectorisation there. Testing it on datasets with a large dimensionality, rather than low dimensionally and lots of data is likely to expose whether that might translate into real results.

@tremblap
Copy link
Member

I go there with the open-minded expectation that losing 10-20% is worth it, especially if we make compiling with local optimisation on so straightforward (I'll try our instructions as part of this process). The idea of accessibility is very near my heart and the beginner with old machine having to compile to go around optimisation is not as inclusive as asking the other way round.

@tremblap
Copy link
Member

speedtest report:

nmf 2400-2600 Max job takes 2500-2700 / 2.75-2.87 scjob takes 2.52-2.72 (faster in SC and slightly slower in Max)

kdtree 16kitems of 200col, fit and query same ball park in Max (55ms and 7ms)

UMAP help file but 500 iterations same ballpark (1750-1850)

fluidtransients~ help but order+window @100 (55%cpu in AM)

sines = help but 64 hop and 16364 fft size (46%cpu in AM)

===
This is sooooo similar I wonder is there a way to make sure I really didn't compile avx? I sort of remember a command line that I could run on an external...

@tremblap
Copy link
Member

ok after testing the same stuff on SC I get again a little less power but quite insignificant. So I vote for merging

@tremblap tremblap merged commit bfe4092 into flucoma:main Sep 20, 2021
@jamesb93
Copy link
Member

WOO🪨

@weefuzzy weefuzzy deleted the fix/sse_by_default branch November 4, 2021 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants