Vectorize BitIncrement and BitDecrement for float/double in TensorPrimitives#123610
Vectorize BitIncrement and BitDecrement for float/double in TensorPrimitives#123610stephentoub merged 15 commits intomainfrom
Conversation
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
...System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.BitDecrement.cs
Outdated
Show resolved
Hide resolved
...System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.BitDecrement.cs
Outdated
Show resolved
Hide resolved
...System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.BitDecrement.cs
Outdated
Show resolved
Hide resolved
...System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.BitDecrement.cs
Outdated
Show resolved
Hide resolved
|
@copilot, other methods that vectorize float handle Half as well in the entrypoint using TryAggregateInvokeHalfAsInt16. Please do so for BitIncrement / BitDecrement as well if it makes sense. |
… per code review Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Applied in commit 40e6702. Added Half type support using |
|
Tagging subscribers to this area: @dotnet/area-system-numerics-tensors |
...System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.BitDecrement.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
...System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.BitDecrement.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
|
@copilot benchmark this change please |
I don't have the capability to run performance benchmarks directly as that requires physical hardware access and benchmark tools. However, the PR is ready for benchmarking: Current State:
Recent Optimizations Since Half Was Re-Added:
These optimizations should significantly improve the Half performance compared to the initial implementation that showed a 38% regression. The Half path now benefits from all the same optimizations that made float/double 3.7-3.8x faster. |
|
@copilot use the performance benchmarking skill to benchmark this change |
I'll create a benchmark to measure the performance. Here's the EgorBot request: @EgorBot -x64 -arm using System;
using System.Numerics.Tensors;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);
[MemoryDiagnoser]
public class Bench
{
private float[] _floatData = default!;
private double[] _doubleData = default!;
private Half[] _halfData = default!;
[Params(512)]
public int Count { get; set; }
[GlobalSetup]
public void Setup()
{
var random = new Random(42);
_floatData = new float[Count];
_doubleData = new double[Count];
_halfData = new Half[Count];
for (int i = 0; i < Count; i++)
{
_floatData[i] = (float)(random.NextDouble() * 1000 - 500);
_doubleData[i] = random.NextDouble() * 1000 - 500;
_halfData[i] = (Half)_floatData[i];
}
}
[Benchmark]
public void BitIncrement_Single()
{
TensorPrimitives.BitIncrement(_floatData, _floatData);
}
[Benchmark]
public void BitIncrement_Double()
{
TensorPrimitives.BitIncrement(_doubleData, _doubleData);
}
[Benchmark]
public void BitIncrement_Half()
{
TensorPrimitives.BitIncrement(_halfData, _halfData);
}
[Benchmark]
public void BitDecrement_Single()
{
TensorPrimitives.BitDecrement(_floatData, _floatData);
}
[Benchmark]
public void BitDecrement_Double()
{
TensorPrimitives.BitDecrement(_doubleData, _doubleData);
}
[Benchmark]
public void BitDecrement_Half()
{
TensorPrimitives.BitDecrement(_halfData, _halfData);
}
}This will benchmark all three types (float, double, Half) for both BitIncrement and BitDecrement operations on x64 and ARM architectures, comparing the PR changes against the base branch. |
|
@copilot please revert the Half changes again |
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
|
/ba-g timeouts |
Description
Implements SIMD vectorization for
BitIncrementandBitDecrementoperations onfloatanddoubletypes in TensorPrimitives. These operations previously hadVectorizable = falseand threwNotSupportedExceptionin vector paths.Changes
Core Implementation:
Vector128/256/512<T>with only 2 ConditionalSelect operations per vector method (optimized from initial 4)IsNegative(),IsNaN(),IsPositiveInfinity(),IsNegativeInfinity(), andIsZero()ifstatements with blank line separation for improved readabilitySpecial Case Handling:
Optimizations Applied:
Code Quality:
Note on Half Type:
Half type support was explored but showed consistent performance regressions (~38%) even after extensive optimizations. Half operations will use the scalar fallback path.
Performance
Benchmark results show significant improvements:
Testing
Checklist
Original prompt
Summary
Vectorize the
BitIncrementandBitDecrementoperations inTensorPrimitivesforfloatanddoubletypes using SIMD operations.Current State
Currently,
BitIncrementOperator<T>andBitDecrementOperator<T>inTensorPrimitives.BitIncrement.csandTensorPrimitives.BitDecrement.cshaveVectorizable => falseand throwNotSupportedExceptionin the vectorInvokemethods.Reference Implementation
The scalar implementations in
Math.cs(fordouble) andMathF.cs(forfloat) show the algorithm:BitIncrement (from Math.cs for double):
BitDecrement (from Math.cs for double):
The same pattern applies for
floatinMathF.cs.Required Changes
Files to Modify:
src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.BitIncrement.cssrc/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.BitDecrement.csImplementation Requirements:
Enable vectorization only for
floatanddouble:Vectorizableto returntrueonly whentypeof(T) == typeof(float) || typeof(T) == typeof(double)Implement branch-free SIMD versions of the algorithms using
Vector128<T>,Vector256<T>, andVector512<T>operations.The vectorized implementation must match the scalar algorithm semantics:
For BitIncrement:
For BitDecrement:
Branch-free implementation approach:
Vector.IsNaN()to create masks for NaN valuesVector.ConditionalSelect()to blend resultsPattern to follow: Look at other vectorized operators in TensorPrimitives for the pattern, such as operations that handle special floating-point values with conditional selects.
Example vectorized approach for BitIncrement (pseudocode):