Skip to content

Conversation

@mattip
Copy link
Contributor

@mattip mattip commented Oct 5, 2020

closes gh-2709 by adding a call to fninit emms to clear the FPU registers.

This seems to be the safe path, unless it can be proven that the kernels will only be called from code that clears the FPU registers and does not set them.

edit: wrong issue number
edit: the original PR had fninit, as per review it was changed to emms

@mattip
Copy link
Contributor Author

mattip commented Oct 5, 2020

My sed foo on windows was not strong enough to consistently format the code. Do you have a preference for blank lines before/after/none?

@mattip
Copy link
Contributor Author

mattip commented Oct 5, 2020

xref numpy/numpy#16744

@carlkl
Copy link

carlkl commented Oct 6, 2020

each fninit should follow something like that:

#ifdef CONSISTENT_FPCSR
      __asm__ __volatile__ ("fnstcw %0"  : "=m" (queue -> x87_mode));
#endif

But I have no idea how to implement this directly in assembler code.

@martin-frbg
Copy link
Collaborator

Note that this one-liner is only the instruction to grab the current setting and store it somewhere in memory (to then change individual bits and put it back via fnldcw). Anyway I am more in favor of using the generic C functions on Windows unless somebody demonstrates that a similar problem exists on other operating systems. I suspect only a handful of the 27 files is
actually used by any reasonably recent TARGET, about half of them appear to be placeholders for the unimplemented quadruple precision mode (nrm2.S being the big exception)

@mattip
Copy link
Contributor Author

mattip commented Oct 6, 2020

Happy to close this in favor of an alternative. We may implement this (or a variant if numpy/numpy#16744 comes up with a NumPy-based solution) as a temporary patch for the upcoming NumPy release, pending a better fix in OpenBLAS or a OS fix from Microsoft. Pinging @charris for thoughts.

@charris
Copy link

charris commented Oct 6, 2020

@mattip I don't have the background to judge the proposed fixes, but I am optimistic that things will get fixed at some point. Thanks for tracking down the problem, that was a nice bit of detective work. We can put a short test in numpy/__init__.py that detects the problem and recommends an upgrade if Microsoft provides a fix, or MKL if they don't. I won't complain if OpenBLAS fixes the problem, but working around compiler/library problems eventually clutters the code. I'm pretty sure Microsoft will fix it at some point, it makes no sense to not fix a bug that appears to randomly manifest here or there, no one could trust their code after that.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Oct 6, 2020

Confirmed now that it is only (z)nrm2.S and (z)sum.S that are actually used by any x86_64 target. (Where the sum.S and zsum.S are trivial hacks of their respective asum files that i made in #2072 to implement the non-absolute xSUM extensions - the active xASUM versions are SSE2 and presumably use some shift trickery to kill the signs)

@martin-frbg
Copy link
Collaborator

Digging deeper, the CONSISTENT_FPCSR option does not need to manipulate individual bits in the fpu state, but what it does is
store the initial state seen when the thread server started, and push this to any newly created thread. Offhand I see no easy way to access what is queue->x87_mode in the context of blas_server_win32.c from where we are in nrm2.S. FNINIT would set the fpu mode to "round to nearest, all exceptions masked, 64bit precision" - which of these is incompatible with numpy expectations ?

@mattip
Copy link
Contributor Author

mattip commented Oct 11, 2020

Truthfully I am not sure. I don't think NumPy consciously sets the FPU mode. There are probably users who set various compilation flags to specify floating point accuracy/speed, and this may mess with their expectations. Maybe @charris or @bashtage have more insight.

The deeper we go into this rabbit hole, the more I think we should push Microsoft to fix the problem. Are the c-based loops that much less performant? Are there benchmarks that stress the [x]nrm2 loops, or could we write one?

@charris
Copy link

charris commented Oct 11, 2020

NumPy uses round to even. The precision probably doesn't matter in practice, but the masking could be problematic, I don't know how it currently operates.

@martin-frbg
Copy link
Collaborator

The C functions are probably not much worse than the assembly with an added FNINIT (hence my alternate PR) but carlkl pointed out that the MS-created problem had better be fixed "somewhere" rather than be left to blow up somebody else later. (OTOH i would really like to get the 0.3.11 release out, though it was not this issue in particular holding it up)

@mattip
Copy link
Contributor Author

mattip commented Oct 11, 2020

Maybe that is an easier ask from Microsoft: could we get a small piece of code we could run after fmod that would clean up whatever mess they are leaving behind? We do this in NumPy to clear divide-by-zero and other errors. I think it is just the ST[0] register, they would know better.

Edit: added this question to the microsoft developer community issue

@mattip
Copy link
Contributor Author

mattip commented Oct 11, 2020

hence my alternate PR

Which?

@mattip
Copy link
Contributor Author

mattip commented Oct 11, 2020

Ahh, got it. gh-2882

@martin-frbg
Copy link
Collaborator

I have merged "my" workaround for now but remain open to more efficient solutions.

@carlkl
Copy link

carlkl commented Oct 12, 2020

The finit instruction at the beginning of each assembler file breaks OpenBLAS CONSISTENT_FPCSR and it does not close #2709. The reason is that finit changes the fpu precision mode to extended mode and this would show up in the numpy tests. A similar problem is described here: numpy/numpy#9580 BUG: Add hypot and cabs functions to WIN32 blacklist.
A better solution is to add the emms instruction instead of finit.

@bashtage
Copy link

Too bad _mm_empty was removed in VS 2015 which issues emms.

See avaneev/avir#7 (comment) who were using it in a similar way it seems.

@martin-frbg
Copy link
Collaborator

Are we sure now that fnclex+emms fixes it ?

@mattip
Copy link
Contributor Author

mattip commented Oct 12, 2020

fnclex alone does not clear anything. fnclex + emms does clear the errors. Now trying just emms alone.

@martin-frbg
Copy link
Collaborator

Ah, good. From what I read about emms I do not expect it to clear the exception flag on its own though. And maybe this should go somewhere in the initialization code of driver/others/blas_server_win32.c rather than the few assembly kernels that use the fpu themselves ?

@mattip
Copy link
Contributor Author

mattip commented Oct 12, 2020

emms alone is enough. Should I update the PR to reflect that or close it?

@mattip
Copy link
Contributor Author

mattip commented Oct 12, 2020

If I could, I would put a call to emms in NumPy just after the call to fmod. I wonder if I can compile with mingw just that routine and link it with MSVC

@carlkl
Copy link

carlkl commented Oct 12, 2020

@mattip, thanks for testing. In the case emms is sufficient means only the stack was corrupted.

'emms' can be compiled with MASM:

.code
_mm_empty PROC
     push rbp
     mov rbp , rsp 
     fnclex
     emms 
     nop
     pop rbp
     ret
_mm_empty endp
end

_mm_empty_x64.zip

I added the C sources for mingw-w64 and assembler code for MASM with the object-files in the attachment.

@martin-frbg
Copy link
Collaborator

Update if you like, with an ifdef OS_WINDOWS around it (unless you can somehow put that emms into numpy, but OTOH some other half-MSVC-half-mingw code could create the same problem "in" OpenBLAS).

@mattip
Copy link
Contributor Author

mattip commented Oct 12, 2020

@carlkl thanks, I will give it a shot over at NumPy. I see my previous mistake was I missed ret.

I will update the PR.

@carlkl
Copy link

carlkl commented Oct 12, 2020

@martin-frbg, all WIN64 programs using UCRT and OpenBLAS at the same time could fail the same way. So I think it's a good idea. to make OpenBLAS robst against such failures. This could be addressed in blas_server_win32.c as well IMHO.

Something like this:

__asm__ __volatile__ ("emms");

@martin-frbg
Copy link
Collaborator

@carlkl thanks, the "how" is clear to me, but I am unsure about the "where" - probably the gotoblas_init constructor in memory.c rather than the blas_server that should only get involved in the multithreaded case but I bet there are some MS subtleties w.r.t initializations of dlls vs. static code.

@carlkl
Copy link

carlkl commented Oct 12, 2020

It should be enough to clean the stack once at the initialization of OpenBLAS. Maybe enclosed with something like that:

#if defined(OS_WINDOWS) && defined(__MINGW64__) && defined(__WIN64__)

@martin-frbg
Copy link
Collaborator

#2889 has that "global" approach now (though untested)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Errors when using OpenBLAS through NumPy on Windows 10 2004

6 participants