add fninit to reset fpu registers before assembler routines #2881

mattip · 2020-10-05T20:41:30Z

closes gh-2709 by adding a call to ~~fninit~~ emms to clear the FPU registers.

This seems to be the safe path, unless it can be proven that the kernels will only be called from code that clears the FPU registers and does not set them.

edit: wrong issue number
edit: the original PR had fninit, as per review it was changed to emms

mattip · 2020-10-05T20:42:43Z

My sed foo on windows was not strong enough to consistently format the code. Do you have a preference for blank lines before/after/none?

mattip · 2020-10-05T20:45:25Z

xref numpy/numpy#16744

carlkl · 2020-10-06T08:24:02Z

each fninit should follow something like that:

#ifdef CONSISTENT_FPCSR
      __asm__ __volatile__ ("fnstcw %0"  : "=m" (queue -> x87_mode));
#endif

But I have no idea how to implement this directly in assembler code.

martin-frbg · 2020-10-06T08:53:31Z

Note that this one-liner is only the instruction to grab the current setting and store it somewhere in memory (to then change individual bits and put it back via fnldcw). Anyway I am more in favor of using the generic C functions on Windows unless somebody demonstrates that a similar problem exists on other operating systems. I suspect only a handful of the 27 files is
actually used by any reasonably recent TARGET, about half of them appear to be placeholders for the unimplemented quadruple precision mode (nrm2.S being the big exception)

mattip · 2020-10-06T08:59:39Z

Happy to close this in favor of an alternative. We may implement this (or a variant if numpy/numpy#16744 comes up with a NumPy-based solution) as a temporary patch for the upcoming NumPy release, pending a better fix in OpenBLAS or a OS fix from Microsoft. Pinging @charris for thoughts.

charris · 2020-10-06T13:53:31Z

@mattip I don't have the background to judge the proposed fixes, but I am optimistic that things will get fixed at some point. Thanks for tracking down the problem, that was a nice bit of detective work. We can put a short test in numpy/__init__.py that detects the problem and recommends an upgrade if Microsoft provides a fix, or MKL if they don't. I won't complain if OpenBLAS fixes the problem, but working around compiler/library problems eventually clutters the code. I'm pretty sure Microsoft will fix it at some point, it makes no sense to not fix a bug that appears to randomly manifest here or there, no one could trust their code after that.

martin-frbg · 2020-10-06T17:11:42Z

Confirmed now that it is only (z)nrm2.S and (z)sum.S that are actually used by any x86_64 target. (Where the sum.S and zsum.S are trivial hacks of their respective asum files that i made in #2072 to implement the non-absolute xSUM extensions - the active xASUM versions are SSE2 and presumably use some shift trickery to kill the signs)

martin-frbg · 2020-10-11T17:46:00Z

Digging deeper, the CONSISTENT_FPCSR option does not need to manipulate individual bits in the fpu state, but what it does is
store the initial state seen when the thread server started, and push this to any newly created thread. Offhand I see no easy way to access what is queue->x87_mode in the context of blas_server_win32.c from where we are in nrm2.S. FNINIT would set the fpu mode to "round to nearest, all exceptions masked, 64bit precision" - which of these is incompatible with numpy expectations ?

mattip · 2020-10-11T17:58:02Z

Truthfully I am not sure. I don't think NumPy consciously sets the FPU mode. There are probably users who set various compilation flags to specify floating point accuracy/speed, and this may mess with their expectations. Maybe @charris or @bashtage have more insight.

The deeper we go into this rabbit hole, the more I think we should push Microsoft to fix the problem. Are the c-based loops that much less performant? Are there benchmarks that stress the [x]nrm2 loops, or could we write one?

charris · 2020-10-11T18:07:28Z

NumPy uses round to even. The precision probably doesn't matter in practice, but the masking could be problematic, I don't know how it currently operates.

martin-frbg · 2020-10-11T18:09:23Z

The C functions are probably not much worse than the assembly with an added FNINIT (hence my alternate PR) but carlkl pointed out that the MS-created problem had better be fixed "somewhere" rather than be left to blow up somebody else later. (OTOH i would really like to get the 0.3.11 release out, though it was not this issue in particular holding it up)

mattip · 2020-10-11T18:15:32Z

Maybe that is an easier ask from Microsoft: could we get a small piece of code we could run after fmod that would clean up whatever mess they are leaving behind? We do this in NumPy to clear divide-by-zero and other errors. I think it is just the ST[0] register, they would know better.

Edit: added this question to the microsoft developer community issue

mattip · 2020-10-11T18:42:54Z

hence my alternate PR

Which?

mattip · 2020-10-11T18:49:24Z

Ahh, got it. gh-2882

martin-frbg · 2020-10-11T20:24:15Z

I have merged "my" workaround for now but remain open to more efficient solutions.

carlkl · 2020-10-12T09:19:55Z

The finit instruction at the beginning of each assembler file breaks OpenBLAS CONSISTENT_FPCSR and it does not close #2709. The reason is that finit changes the fpu precision mode to extended mode and this would show up in the numpy tests. A similar problem is described here: numpy/numpy#9580 BUG: Add hypot and cabs functions to WIN32 blacklist.
A better solution is to add the emms instruction instead of finit.

bashtage · 2020-10-12T09:26:28Z

Too bad _mm_empty was removed in VS 2015 which issues emms.

See avaneev/avir#7 (comment) who were using it in a similar way it seems.

martin-frbg · 2020-10-12T11:49:05Z

Are we sure now that fnclex+emms fixes it ?

mattip · 2020-10-12T14:26:24Z

fnclex alone does not clear anything. fnclex + emms does clear the errors. Now trying just emms alone.

martin-frbg · 2020-10-12T14:49:47Z

Ah, good. From what I read about emms I do not expect it to clear the exception flag on its own though. And maybe this should go somewhere in the initialization code of driver/others/blas_server_win32.c rather than the few assembly kernels that use the fpu themselves ?

mattip · 2020-10-12T14:58:44Z

emms alone is enough. Should I update the PR to reflect that or close it?

mattip · 2020-10-12T15:01:11Z

If I could, I would put a call to emms in NumPy just after the call to fmod. I wonder if I can compile with mingw just that routine and link it with MSVC

carlkl · 2020-10-12T15:09:12Z

@mattip, thanks for testing. In the case emms is sufficient means only the stack was corrupted.

'emms' can be compiled with MASM:

.code
_mm_empty PROC
     push rbp
     mov rbp , rsp 
     fnclex
     emms 
     nop
     pop rbp
     ret
_mm_empty endp
end

_mm_empty_x64.zip

I added the C sources for mingw-w64 and assembler code for MASM with the object-files in the attachment.

martin-frbg · 2020-10-12T15:09:41Z

Update if you like, with an ifdef OS_WINDOWS around it (unless you can somehow put that emms into numpy, but OTOH some other half-MSVC-half-mingw code could create the same problem "in" OpenBLAS).

mattip · 2020-10-12T15:13:18Z

@carlkl thanks, I will give it a shot over at NumPy. I see my previous mistake was I missed ret.

I will update the PR.

carlkl · 2020-10-12T15:19:39Z

@martin-frbg, all WIN64 programs using UCRT and OpenBLAS at the same time could fail the same way. So I think it's a good idea. to make OpenBLAS robst against such failures. This could be addressed in blas_server_win32.c as well IMHO.

Something like this:

__asm__ __volatile__ ("emms");

martin-frbg · 2020-10-12T15:51:24Z

@carlkl thanks, the "how" is clear to me, but I am unsure about the "where" - probably the gotoblas_init constructor in memory.c rather than the blas_server that should only get involved in the multithreaded case but I bet there are some MS subtleties w.r.t initializations of dlls vs. static code.

carlkl · 2020-10-12T16:07:42Z

It should be enough to clean the stack once at the initialization of OpenBLAS. Maybe enclosed with something like that:

#if defined(OS_WINDOWS) && defined(__MINGW64__) && defined(__WIN64__)

martin-frbg · 2020-10-12T17:08:09Z

#2889 has that "global" approach now (though untested)

add fninit to reset fpu registers before assembler routines

a5b1649

use emms instead, add WIN guards

403eb51

martin-frbg mentioned this pull request Oct 12, 2020

Reset the fpu stack during OpenBLAS initialization when on Windows #2889

Closed

martin-frbg added this to the 0.3.11 milestone Oct 12, 2020

martin-frbg merged commit 0c84ffe into OpenMathLib:develop Oct 12, 2020

ViralBShah mentioned this pull request Sep 26, 2021

Exception access violation / openblas / Julia 1.6.1 JuliaLang/LinearAlgebra.jl#854

Closed

Rolafawaz approved these changes Dec 11, 2023

View reviewed changes

add fninit to reset fpu registers before assembler routines #2881

add fninit to reset fpu registers before assembler routines #2881

Uh oh!

Conversation

mattip commented Oct 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented Oct 5, 2020

Uh oh!

mattip commented Oct 5, 2020

Uh oh!

carlkl commented Oct 6, 2020

Uh oh!

martin-frbg commented Oct 6, 2020

Uh oh!

mattip commented Oct 6, 2020

Uh oh!

charris commented Oct 6, 2020

Uh oh!

martin-frbg commented Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Oct 11, 2020

Uh oh!

mattip commented Oct 11, 2020

Uh oh!

charris commented Oct 11, 2020

Uh oh!

martin-frbg commented Oct 11, 2020

Uh oh!

mattip commented Oct 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented Oct 11, 2020

Uh oh!

mattip commented Oct 11, 2020

Uh oh!

martin-frbg commented Oct 11, 2020

Uh oh!

carlkl commented Oct 12, 2020

Uh oh!

bashtage commented Oct 12, 2020

Uh oh!

martin-frbg commented Oct 12, 2020

Uh oh!

mattip commented Oct 12, 2020

Uh oh!

martin-frbg commented Oct 12, 2020

Uh oh!

mattip commented Oct 12, 2020

Uh oh!

mattip commented Oct 12, 2020

Uh oh!

carlkl commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Oct 12, 2020

Uh oh!

mattip commented Oct 12, 2020

Uh oh!

carlkl commented Oct 12, 2020

Uh oh!

martin-frbg commented Oct 12, 2020

Uh oh!

carlkl commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Oct 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mattip commented Oct 5, 2020 •

edited

Loading

martin-frbg commented Oct 6, 2020 •

edited

Loading

mattip commented Oct 11, 2020 •

edited

Loading

carlkl commented Oct 12, 2020 •

edited

Loading

carlkl commented Oct 12, 2020 •

edited

Loading