JIT: Improve codegen for xarch vector byte multiply#126348
JIT: Improve codegen for xarch vector byte multiply#126348saucecontrol wants to merge 6 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
Improves xarch JIT codegen for SIMD byte multiplication by using a more efficient two-multiply “odd/even byte” strategy when widening to the next vector size isn’t possible, reducing unnecessary widen/narrow work compared to the prior fallback.
Changes:
- Adds a fast-path that widens to the next vector size (AVX2 for SIMD16, AVX512 for SIMD32) to perform a single multiply and then narrow.
- Replaces the previous fallback (split/widen/mul/narrow twice) with an odd/even byte approach that uses two 16-bit multiplies and recombines bytes with masks/shifts.
the current codegen has an invalid operand for the instruction? should be and a similar issue for |
Ha, I didn't notice that. Looks like that's just a bug in the JIT disasm. Running the code bytes through another disassembler shows it correctly. |
|
cc @dotnet/jit-contrib |
This comment was marked as outdated.
This comment was marked as outdated.
|
@EgorBot -amd -intel using System;
using System.Linq;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Columns;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.ConsoleArguments;
using BenchmarkDotNet.Filters;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Loggers;
using BenchmarkDotNet.Running;
if (args.Contains("--list"))
{
foreach (var method in typeof(ByteMul).GetMethods().Where(m => m.IsDefined(typeof(BenchmarkAttribute), true)))
Console.WriteLine($"{method.DeclaringType!.Name}.{method.Name}");
return;
}
var (_, config, _) = ConfigParser.Parse(args, ConsoleLogger.Default);
BenchmarkRunner.Run<ByteMul>(BdnConfig.AddJobs(config));
[HideColumns(Column.EnvironmentVariables)]
public unsafe class ByteMul
{
private const int len = 1 << 16;
private byte* data;
[GlobalSetup]
public void Setup()
{
data = (byte*)NativeMemory.AlignedAlloc(len, 64);
Random.Shared.NextBytes(new Span<byte>(data, len));
}
[Benchmark]
public Vector128<byte> MultiplyV128()
{
byte* ptr = data, end = ptr + len - Vector128<byte>.Count;
var res = Vector128<byte>.Zero;
while (ptr < end)
{
res ^= Vector128.LoadAligned(ptr) * Vector128.LoadAligned(ptr + Vector128<byte>.Count);
ptr += Vector128<byte>.Count;
}
return res;
}
[Benchmark]
public Vector256<byte> MultiplyV256()
{
byte* ptr = data, end = ptr + len - Vector256<byte>.Count;
var res = Vector256<byte>.Zero;
while (ptr < end)
{
res ^= Vector256.LoadAligned(ptr) * Vector256.LoadAligned(ptr + Vector256<byte>.Count);
ptr += Vector256<byte>.Count;
}
return res;
}
[Benchmark]
public Vector512<byte> MultiplyV512()
{
byte* ptr = data, end = ptr + len - Vector512<byte>.Count;
var res = Vector512<byte>.Zero;
while (ptr < end)
{
res ^= Vector512.LoadAligned(ptr) * Vector512.LoadAligned(ptr + Vector512<byte>.Count);
ptr += Vector512<byte>.Count;
}
return res;
}
}
public static class BdnConfig
{
public static IConfig AddJobs(IConfig config)
{
const string defaultJob = "AXV-512";
const string noAvx512Job = "AVX2";
const string noAvxJob = "SSE4.2";
var newcfg = ManualConfig.Create(DefaultConfig.Instance);
foreach (var job in config.GetJobs())
{
newcfg
.AddJob(job.UnfreezeCopy().WithId(defaultJob))
.AddJob(job.WithEnvironmentVariable(new EnvironmentVariable("DOTNET_EnableAVX512", "0")).WithId(noAvx512Job).WithBaseline(false))
.AddJob(job.WithEnvironmentVariable(new EnvironmentVariable("DOTNET_EnableAVX", "0")).WithId(noAvxJob).WithBaseline(false));
}
return newcfg.AddFilter(new SimpleFilter(benchmarkCase => {
bool skipAvx = benchmarkCase.Job.Id.StartsWith(noAvxJob);
bool skipAvx512 = skipAvx || benchmarkCase.Job.Id.StartsWith(noAvx512Job);
var methodName = benchmarkCase.Descriptor.WorkloadMethod.Name;
bool isV256 = methodName.Contains("V256");
bool isV512 = methodName.Contains("V512");
return (!isV256 || !skipAvx) && (!isV512 || !skipAvx512);
}));
}
} |
This comment was marked as resolved.
This comment was marked as resolved.
Done |
Resolves #109775
In cases where we can widen to the next vector size up and multiply once, the current codegen is already good. When that's not possible, the current codegen falls back to a version that splits into two vectors and runs the same basic algorithm, which is not optimal.
This implements the suggestion made by @MineCake147E on #109775, which still requires two multiplications, but avoids double widening and narrowing. The result is a ~2x perf improvement. Benchmarks
Typical diff:
Full diffs
NB: codegen could actually be better, but currently JIT imports (and morphs)
AND_NOT(x, y)asAND(x, NOT(y)), which in the case of constantythat is re-used, creates two constants where one would have sufficed.