This proposal extends #23057 to additionally cover intrinsics for the scalar forms of the x86 hardware instructions.
Rationale
Currently, CoreFX implicitly emits various SSE/SSE2 instructions when performing operations on scalar floating-point types. However, there is currently no way to explicitly emit specific optimized sequences of code to best take advantage of the underlying hardware.
For example, the System.Math and System.MathF APIs are currently implemented as FCALLs to the underlying C Runtime implementation of the corresponding method. The underlying C Runtime implementations are most frequently implemented as hand-optimized assembly or C/C++ intrinsics to ensure that they provide the best throughput possible and that they take advantage of newer instruction sets when available (cos/cosf, for example, frequently have one code path for SSE and one for FMA3).
Due to these methods depending on the underlying C Runtime implementation:
- We are at a significantly delayed cadence for getting bug fixes
- We often have to add hacks to workaround these bugs as we find them, decreasing perf
- The implementations between platforms and architectures often differ
- Updating these methods requires modifying the runtime
By providing scalar intrinsics for the Intel hardware functions it becomes much easier to implement these functions in managed code, which:
- Means fewer workarounds as bugs can be fixed directly
- Keeps perf is more consistent
- Keeps input/output differences minimal or non-existing between operating systems
- Helps keep input/output are minimal between platforms
- Allows most bug fixes to be made independent of the runtime
Furthermore, with the addition of #23057, it may become more pertinent to also have these scalar intrinsics to ensure that the codegen when intermixing scalar and vectorized operations remains "optimal" for the end-users customized/hand-optimized algorithms.
Proposed API
The current design in #23057 creates a class per instruction set and exposes methods such as Vector128<double> Sse2.Sqrt(Vector128<double>) which corresponds to the __m128d _mm_sqrt_pd(__m128d) C/C++ intrinsic`.
This would additionally extend the surface area to expose the scalar forms of all the instructions with the same name as the vector intrinsic, but with a Scalar postfix. This is required to differentiate between the vector and scalar APIs due to them taking the same types as their inputs.
For example, we would expose:
public static class Sse2
{
// __m128d _mm_add_sd(__m128d a, __m128d b);
public static Vector128<double> AddScalar<double>(Vector128<double> left, Vector128<double> right);
// ...
// No corresponding C/C++ intrinsic, used when upper should be taken from `value`
public static Vector128<double> SqrtScalar(Vector128<double> value);
// __m128d _mm_sqrt_sd(__m128d a, __m128d b)
public static Vector128<double> SqrtScalar(Vector128<double> upper, Vector128<double> value);
// ...
}
Other Thoughts
Most of the remaining sections (Intended Audience, Semantics and Usage, Implementation Roadmap, etc) are the same as in #23057
This proposal extends #23057 to additionally cover intrinsics for the
scalarforms of the x86 hardware instructions.Rationale
Currently, CoreFX implicitly emits various SSE/SSE2 instructions when performing operations on scalar floating-point types. However, there is currently no way to explicitly emit specific optimized sequences of code to best take advantage of the underlying hardware.
For example, the
System.MathandSystem.MathFAPIs are currently implemented as FCALLs to the underlying C Runtime implementation of the corresponding method. The underlying C Runtime implementations are most frequently implemented as hand-optimized assembly or C/C++ intrinsics to ensure that they provide the best throughput possible and that they take advantage of newer instruction sets when available (cos/cosf, for example, frequently have one code path for SSE and one for FMA3).Due to these methods depending on the underlying C Runtime implementation:
By providing scalar intrinsics for the Intel hardware functions it becomes much easier to implement these functions in managed code, which:
Furthermore, with the addition of #23057, it may become more pertinent to also have these scalar intrinsics to ensure that the codegen when intermixing scalar and vectorized operations remains "optimal" for the end-users customized/hand-optimized algorithms.
Proposed API
The current design in #23057 creates a class per instruction set and exposes methods such as
Vector128<double> Sse2.Sqrt(Vector128<double>)which corresponds to the__m128d _mm_sqrt_pd(__m128d)C/C++ intrinsic`.This would additionally extend the surface area to expose the scalar forms of all the instructions with the same name as the vector intrinsic, but with a
Scalarpostfix. This is required to differentiate between the vector and scalar APIs due to them taking the same types as their inputs.For example, we would expose:
Other Thoughts
Most of the remaining sections (Intended Audience, Semantics and Usage, Implementation Roadmap, etc) are the same as in #23057