-
Notifications
You must be signed in to change notification settings - Fork 19
Switch to using SSE by default for Intel builds, rather than AVX #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@weefuzzy Would those defaults be activated by passing my own flags to FLUID_ARCH? I am not explicitly modifying any other parts of the built process apart from the OSX_ARCHITECTURES flag. Is there a way to verify if I was using said SIMD ops? |
They would only activate if nothing was passed in
If you're building with ninja or make, you can get CMake to produce a compile commands JSON file; then grep for the relevant flags. In XCode, you can get to the raw compile commands through the report navigator (speech bubble icon) |
|
What are the relevant flags I should grep for? |
|
|
|
I couldn't find anything. |
|
Those compile commands are all x64 builds. You'd probably need to delete your CMake cache or whole build tree to reset to doing ARM or both. Although, worryingly I didn't see either AVX or SSE in there either (although perhaps your FLUID_ARCH is already in the cache as being blank?) |
|
It will take a mo ⏲️ but I will iterate through all the possible combos of flags cleaning between each :) |
|
This does not work if I don't pass FLUID_ARCH. It whinges about not being able to build for armv7. |
|
Presumably because it's automatically downloading HISSTools. What happens with cmake -GNinja -DMAX_SDK_PATH=../sdk -DHISS_PATH=<Some local and up to date HISSTools clone> ..? |
|
This fails:
but this works:
|
|
What's the reason though? (and I'm confused about the If the HISSTools in |
|
I guess I'm a little lost too 🪘
Perhaps you could draw up a list of combinations that you want me to run because I'm getting bogged down in dependency + flags and what is meaningful in helping push forward this side of things. |
|
Ok. One ambition is to make sure So, in a nice world, all of the following would work: I would expect to produce either an arm or a universal binary I would expect just arm64 I would expect just x64 If any of them fail, post the error messages and we'll take it from there |
|
And just to clarify which SDK would this point to in each example? AFAIK you cannot make a fat binary with the old sdk. |
|
8.2 in all cases. I'm happy enough that we can build < 8.2 on x64 |
|
error1.txt None of these work and I think its because of the lack of FLUID_ARCH. They are respective to the commands you wrote out. |
|
Only indirectly. For the first two, it doesn't like the flags that I plonked in (for R-PI maybe). For the third, it's still picking up ARM in the platform test, so we'll need to find a better platform test
|
|
Anyway, thanks for looking at this. I'll borrow my beloved's m1 in due course and have a crack at fixing all this properly |
|
Feel free to bug me with more things to try, and sorry for the confusion somewhere in the middle there 😭 |
|
In case it's helpful (and not obvious) I believe that arm64 is the correct thing to pass for ARM M1 builds (that is what is happening in the recent work I did on M1ifying some externals through c74) |
|
Cheers. Yes, it seemed a bit much to expect whatever I'd pasted from somewhere else for R-PI or Bela or something to Just Work 😆 Having skimmed the CMake issues list, it looks like there's possible additional complications based on whether one is running the intel or arm build of CMake on Silicon, and that relying on the processor type isn't going to be useful for managing this, so we'll maybe need a different set of conditions for Apple. Now. I have access to a Shiny Thing I can mash the keyboard until it works. |
|
The machine you are compiling on is irrelevant. The only relevant thing is your version of Xcode, but older versions simply ignore the arm64 architecture flag. Given that 32 bit Max is no longer a thing that produces acceptable results In all circumstances for the versions of Max you are targeting - no? |
|
(assuming you are passing both flags and making a UB2) |
|
Thanks – Xcode and Cmake don't behave the same here (from what I gathered earlier); the issue isn't the machine – different, and not wholly predictable host processors were getting reported depending on whether CMake was running under Rosetta or not. It's also not clear that one will get UB2 by default without manually setting the flags, the upshot being that I need more intelligent probing CMake code than I currently have. |
|
Well my suggestion is that you manually set the flags in all cases to x86_64 arm64 (a UB2) - what is the case that you don't think this will satisfy? In other words I am unclear as to why you need to know the processor of the machine you are building on. The default for Max OS debug builds is native architecture only, so if you want to avoid building both in that circumstance you need to target configurations, but you still wouldn't need to know the architecture of the build machine. |
|
They came from somewhere. Maybe it was bela not R-PI 🤷 Anyway, the happy news seems to be that I might be able to get rid of some complexity in the build by ditching this whole thing. |
|
@jamesb93 I've now made it so that the silly things shouldn't happen on arm64. So it should work with nothing in I have high hopes for: |
|
I get this error atm. I've looked at the CMake Error at /Users/james/dev/flucoma/core/script/flucoma_simdcmd.cmake:27 (endif):
Flow control statements are not properly nested.
Call Stack (most recent call first):
/Users/james/dev/flucoma/core/CMakeLists.txt:184 (include)EDIT: here was my build command:
|
|
Whoops. Extra |
|
I can build an intel 64 bit binary that works under Rosetta:
This is with the 8.0.3 Max SDK. Building a fat binary with the 8.0.3 SDK doesn't work and when I point to the beta 8.2 I get this error: with the following command:
So I assume its not accounting for the change in structure of the SDK (yet?). |
|
Ok. I would expect this to work in conjunction with flucoma/flucoma-max#32 because that deals with the SDK layout changes. |
Testing against flucoma/flucoma-max#32 with:
gives: |
|
Gah. I see what I did wrong there. Try now? |
|
CMake Error at /Users/james/dev/flucoma/core/script/flucoma_simdcmd.cmake:13: -- Configuring incomplete, errors occurred! |
|
🤦 and now? |
|
Works to make a fat boi! 🍬 |
|
Just testing to see if the old sdk + building an intel binary still works |
|
yes! woo - nice work @weefuzzy |
|
Sooooooo my list of tests for this:
Then we talked about performance loss assessment (to see if that is a real problem) and I had taken notes of that list a long time ago:
I expect a little loss and that is fine as we agreed. I'll also try with AVX512 to see if that gives any gain for realz Do I forget anything? |
|
It might be worth noting that the speed differential between SSE and AVX may have complex links to the specifics of the architecture of the testing machine - or in other words if you know how much faster it is on one machine, that might not translate directly to another one. The gains will also depend on how much of your code is likely to directly make use of AVX. However, the larger vector size only guarantees that equivalent operations can be done on more data with less CPU instructions - it may, however be the case that the AVX instructions take more cycles, and hence the relationship to actual performance gains is complex on these levels (before you even think about memory speed etc.) If there are significant benefits to AVX (or AVX512) then I can talk through how you might do runtime selection in a manageable fashion, but it is not a simple issue at all as you have to compile different files with different compiler flags, and create variants of the code for all relevant sections, which also bloats code size. The selection code must also use the lowest guaranteed instructions only. The simplest way of achieving all of this, which I've considered for my externals/other projects would be to compile variants of the entire object and then just branch to the right one at initialisation, but that's least efficient in terms of memory used to load the code. A simpler option is to provide multiple compiles directly, but that won't play nice with the Max package manager etc. |
|
Thanks for those useful notes. I'd certainly need a great deal of convincing to sink time into runtime branching at this point 😆 I'm not expecting huge differences, tbh. We can probably make a decent stab at predicting whether it would make much difference at all by looking at the generated assembly. For instance, notionally the most expensive bit about querying a Kd Tree is the distance calculations, but this may not get vectorised at all (and the cost of the calculation itself might be smaller than the time from cache misses). So, I agree with the thrust of Alex's advice, as I read it, which is that doing a sanity check is wholly appropriate but there's no point reading too much into the detail of the results. For the future, there is much we could do to find out what the compiler is able to vectorise, and to change code to make its job easier. |
|
I really wouldn't expect auto-vectorisation on things like a kdtree - the compiler is only likely to auto-vectorise on fairly loops dealing with consistently laid out memory - you are unlikely to have much code of that nature, given Gerard's coding style. I have seen little in the codebase that I would expect to auto-vectorise. Looking at the assembly is IMHO unlikely to tell you much at all, unless you are willing to sit with some reference manuals that tell you how many CPU cycles various instructions take on the kind of hardware you expect to run on - that's a very time-consuming pursuit. The biggest wins I would expect would likely be in large Eigen operations where the code is explicitly written with intrinsics (there's also some explicitly vectorised stuff in my code also, but most of it affects only the transient objects, with the exception of the FFT on windows - if you have something for which the FFT is a big cost then that would be worth testing). So, things with big matrix operations (there's a solver in the transient stuff, for instance, which is quite a slow part of the code). That might be useful in figuring out which objects are likely to be candidates for more detailed testing. |
|
Ah sorry - the distance measure in KDTree is done using vectors in Eigen, so I would expect an explicit vectorisation there. Testing it on datasets with a large dimensionality, rather than low dimensionally and lots of data is likely to expose whether that might translate into real results. |
|
I go there with the open-minded expectation that losing 10-20% is worth it, especially if we make compiling with local optimisation on so straightforward (I'll try our instructions as part of this process). The idea of accessibility is very near my heart and the beginner with old machine having to compile to go around optimisation is not as inclusive as asking the other way round. |
|
speedtest report: nmf 2400-2600 Max job takes 2500-2700 / 2.75-2.87 scjob takes 2.52-2.72 (faster in SC and slightly slower in Max) kdtree 16kitems of 200col, fit and query same ball park in Max (55ms and 7ms) UMAP help file but 500 iterations same ballpark (1750-1850) fluidtransients~ help but order+window @100 (55%cpu in AM) sines = help but 64 hop and 16364 fft size (46%cpu in AM) === |
|
ok after testing the same stuff on SC I get again a little less power but quite insignificant. So I vote for merging |
|
WOO🪨 |
As discussed. Between folk with new M1 machines and folk with older machines, having AVX on by default creates problems out of proportion to the benefits. This change uses SSE by default instead on x64/86 and people can still set their own preferred options using
FLUID_ARCH@jamesb93 There are SIMD options for ARM by default, but TBH I don't know how sensible they are. Are you using these on your M1 builds?