-
Notifications
You must be signed in to change notification settings - Fork 1.6k
add fninit to reset fpu registers before assembler routines #2881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
My sed foo on windows was not strong enough to consistently format the code. Do you have a preference for blank lines before/after/none? |
|
xref numpy/numpy#16744 |
|
each fninit should follow something like that: #ifdef CONSISTENT_FPCSR
__asm__ __volatile__ ("fnstcw %0" : "=m" (queue -> x87_mode));
#endifBut I have no idea how to implement this directly in assembler code. |
|
Note that this one-liner is only the instruction to grab the current setting and store it somewhere in memory (to then change individual bits and put it back via fnldcw). Anyway I am more in favor of using the generic C functions on Windows unless somebody demonstrates that a similar problem exists on other operating systems. I suspect only a handful of the 27 files is |
|
Happy to close this in favor of an alternative. We may implement this (or a variant if numpy/numpy#16744 comes up with a NumPy-based solution) as a temporary patch for the upcoming NumPy release, pending a better fix in OpenBLAS or a OS fix from Microsoft. Pinging @charris for thoughts. |
|
@mattip I don't have the background to judge the proposed fixes, but I am optimistic that things will get fixed at some point. Thanks for tracking down the problem, that was a nice bit of detective work. We can put a short test in |
|
Confirmed now that it is only (z)nrm2.S and (z)sum.S that are actually used by any x86_64 target. (Where the sum.S and zsum.S are trivial hacks of their respective asum files that i made in #2072 to implement the non-absolute xSUM extensions - the active xASUM versions are SSE2 and presumably use some shift trickery to kill the signs) |
|
Digging deeper, the CONSISTENT_FPCSR option does not need to manipulate individual bits in the fpu state, but what it does is |
|
Truthfully I am not sure. I don't think NumPy consciously sets the FPU mode. There are probably users who set various compilation flags to specify floating point accuracy/speed, and this may mess with their expectations. Maybe @charris or @bashtage have more insight. The deeper we go into this rabbit hole, the more I think we should push Microsoft to fix the problem. Are the c-based loops that much less performant? Are there benchmarks that stress the [x]nrm2 loops, or could we write one? |
|
NumPy uses round to even. The precision probably doesn't matter in practice, but the masking could be problematic, I don't know how it currently operates. |
|
The C functions are probably not much worse than the assembly with an added FNINIT (hence my alternate PR) but carlkl pointed out that the MS-created problem had better be fixed "somewhere" rather than be left to blow up somebody else later. (OTOH i would really like to get the 0.3.11 release out, though it was not this issue in particular holding it up) |
|
Maybe that is an easier ask from Microsoft: could we get a small piece of code we could run after Edit: added this question to the microsoft developer community issue |
Which? |
|
Ahh, got it. gh-2882 |
|
I have merged "my" workaround for now but remain open to more efficient solutions. |
|
The |
|
Too bad See avaneev/avir#7 (comment) who were using it in a similar way it seems. |
|
Are we sure now that fnclex+emms fixes it ? |
|
|
|
Ah, good. From what I read about |
|
|
|
If I could, I would put a call to |
|
@mattip, thanks for testing. In the case 'emms' can be compiled with MASM: .code
_mm_empty PROC
push rbp
mov rbp , rsp
fnclex
emms
nop
pop rbp
ret
_mm_empty endp
endI added the C sources for mingw-w64 and assembler code for MASM with the object-files in the attachment. |
|
Update if you like, with an |
|
@carlkl thanks, I will give it a shot over at NumPy. I see my previous mistake was I missed I will update the PR. |
|
@martin-frbg, all WIN64 programs using UCRT and OpenBLAS at the same time could fail the same way. So I think it's a good idea. to make OpenBLAS robst against such failures. This could be addressed in Something like this: __asm__ __volatile__ ("emms"); |
|
@carlkl thanks, the "how" is clear to me, but I am unsure about the "where" - probably the gotoblas_init constructor in memory.c rather than the blas_server that should only get involved in the multithreaded case but I bet there are some MS subtleties w.r.t initializations of dlls vs. static code. |
|
It should be enough to clean the stack once at the initialization of OpenBLAS. Maybe enclosed with something like that: #if defined(OS_WINDOWS) && defined(__MINGW64__) && defined(__WIN64__) |
|
#2889 has that "global" approach now (though untested) |
closes gh-2709 by adding a call to
fninitemmsto clear the FPU registers.This seems to be the safe path, unless it can be proven that the kernels will only be called from code that clears the FPU registers and does not set them.
edit: wrong issue number
edit: the original PR had
fninit, as per review it was changed toemms