JIT: Improve codegen for xarch vector byte multiply by saucecontrol · Pull Request #126348 · dotnet/runtime

saucecontrol · 2026-03-31T06:25:29Z

In cases where we can widen to the next vector size up and multiply once, the current codegen is already good. When that's not possible, the current codegen falls back to a version that splits into two vectors and runs the same basic algorithm, which is not optimal.

This implements the suggestion made by @MineCake147E on #109775, which still requires two multiplications, but avoids double widening and narrowing. The result is a ~2x perf improvement. Benchmarks

Typical diff:

        vmovups  zmm0, zmmword ptr [rdx]
-       vmovaps  zmm1, zmm0
-       vpmovzxbw zmm1, ymm1
-       vmovups  zmm2, zmmword ptr [r8]
-       vmovaps  zmm3, zmm2
-       vpmovzxbw zmm3, ymm3
-       vpmullw  zmm1, zmm3, zmm1
-       vpmovwb  ymm1, zmm1
-       vextracti32x8 ymm0, zmm0, 1
-       vpmovzxbw zmm0, ymm0
-       vextracti32x8 ymm2, zmm2, 1
-       vpmovzxbw zmm2, ymm2
-       vpmullw  zmm0, zmm2, zmm0
-       vpmovwb  ymm0, zmm0
-       vinserti32x8 zmm0, zmm1, ymm0, 1
-       vmovups  zmmword ptr [rcx], zmm0
+       vmovups  zmm1, zmmword ptr [r8]
+       vpmullw  zmm2, zmm1, zmm0
+       vpsrlw   zmm0, zmm0, 8
+       vpandd   zmm1, zmm1, dword ptr [reloc @RWD00] {1to16}
+       vpmullw  zmm0, zmm1, zmm0
+       vpternlogd zmm2, zmm0, dword ptr [reloc @RWD04] {1to16}, -20
+       vmovups  zmmword ptr [rcx], zmm2
        mov      rax, rcx
        vzeroupper 
        ret      

+RWD00  	dd	FF00FF00h
+RWD04  	dd	00FF00FFh
 
-; Total bytes of code 106
+; Total bytes of code 65

Full diffs

NB: codegen could actually be better, but currently JIT imports (and morphs) AND_NOT(x, y) as AND(x, NOT(y)), which in the case of constant y that is re-used, creates two constants where one would have sufficed.

dotnet-policy-service · 2026-03-31T06:28:00Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

Improves xarch JIT codegen for SIMD byte multiplication by using a more efficient two-multiply “odd/even byte” strategy when widening to the next vector size isn’t possible, reducing unnecessary widen/narrow work compared to the prior fallback.

Changes:

Adds a fast-path that widens to the next vector size (AVX2 for SIMD16, AVX512 for SIMD32) to perform a single multiply and then narrow.
Replaces the previous fallback (split/widen/mul/narrow twice) with an odd/even byte approach that uses two 16-bit multiplies and recombines bytes with masks/shifts.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

xtqqczze · 2026-03-31T11:51:59Z

vpmovzxbw zmm1, zmm1

the current codegen has an invalid operand for the instruction?

should be VPMOVZXBW zmm1 {k1}{z}, ymm2/m256?

and a similar issue for vpmovwb?

saucecontrol · 2026-03-31T16:31:57Z

the current codegen has an invalid operand for the instruction?

Ha, I didn't notice that. Looks like that's just a bug in the JIT disasm. Running the code bytes through another disassembler shows it correctly.

saucecontrol · 2026-03-31T16:50:51Z

cc @dotnet/jit-contrib

Diffs

saucecontrol · 2026-04-04T04:29:09Z

@EgorBot -amd -intel

using System;
using System.Linq;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Columns;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.ConsoleArguments;
using BenchmarkDotNet.Filters;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Loggers;
using BenchmarkDotNet.Running;

if (args.Contains("--list"))
{
    foreach (var method in typeof(ByteMul).GetMethods().Where(m => m.IsDefined(typeof(BenchmarkAttribute), true)))
        Console.WriteLine($"{method.DeclaringType!.Name}.{method.Name}");

    return;
}

var (_, config, _) = ConfigParser.Parse(args, ConsoleLogger.Default);

BenchmarkRunner.Run<ByteMul>(BdnConfig.AddJobs(config));

[HideColumns(Column.EnvironmentVariables)]
public unsafe class ByteMul
{
    private const int len = 1 << 16;
    private byte* data;

    [GlobalSetup]
    public void Setup()
    {
        data = (byte*)NativeMemory.AlignedAlloc(len, 64);
        Random.Shared.NextBytes(new Span<byte>(data, len));
    }

    [Benchmark]
    public Vector128<byte> MultiplyV128()
    {
        byte* ptr = data, end = ptr + len - Vector128<byte>.Count;
        var res = Vector128<byte>.Zero;

        while (ptr < end)
        {
            res ^= Vector128.LoadAligned(ptr) * Vector128.LoadAligned(ptr + Vector128<byte>.Count);
            ptr += Vector128<byte>.Count;
        }

        return res;
    }

    [Benchmark]
    public Vector256<byte> MultiplyV256()
    {
        byte* ptr = data, end = ptr + len - Vector256<byte>.Count;
        var res = Vector256<byte>.Zero;

        while (ptr < end)
        {
            res ^= Vector256.LoadAligned(ptr) * Vector256.LoadAligned(ptr + Vector256<byte>.Count);
            ptr += Vector256<byte>.Count;
        }

        return res;
    }

    [Benchmark]
    public Vector512<byte> MultiplyV512()
    {
        byte* ptr = data, end = ptr + len - Vector512<byte>.Count;
        var res = Vector512<byte>.Zero;

        while (ptr < end)
        {
            res ^= Vector512.LoadAligned(ptr) * Vector512.LoadAligned(ptr + Vector512<byte>.Count);
            ptr += Vector512<byte>.Count;
        }

        return res;
    }
}

public static class BdnConfig
{
    public static IConfig AddJobs(IConfig config)
    {
        const string defaultJob = "AXV-512";
        const string noAvx512Job = "AVX2";
        const string noAvxJob = "SSE4.2";

        var newcfg = ManualConfig.Create(DefaultConfig.Instance);

        foreach (var job in config.GetJobs())
        {
            newcfg
                .AddJob(job.UnfreezeCopy().WithId(defaultJob))
                .AddJob(job.WithEnvironmentVariable(new EnvironmentVariable("DOTNET_EnableAVX512", "0")).WithId(noAvx512Job).WithBaseline(false))
                .AddJob(job.WithEnvironmentVariable(new EnvironmentVariable("DOTNET_EnableAVX", "0")).WithId(noAvxJob).WithBaseline(false));
        }

        return newcfg.AddFilter(new SimpleFilter(benchmarkCase => {
            bool skipAvx = benchmarkCase.Job.Id.StartsWith(noAvxJob);
            bool skipAvx512 = skipAvx || benchmarkCase.Job.Id.StartsWith(noAvx512Job);

            var methodName = benchmarkCase.Descriptor.WorkloadMethod.Name;
            bool isV256 = methodName.Contains("V256");
            bool isV512 = methodName.Contains("V512");

            return (!isV256 || !skipAvx) && (!isV512 || !skipAvx512);
        }));
    }
}

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

saucecontrol · 2026-04-17T19:13:02Z

@saucecontrol Could you update the diff in the description now #126371 is merged?

Done

tannergooding · 2026-04-21T21:20:37Z

CC. @dotnet/jit-contrib, @kg, @EgorBo for secondary review

improve codegen for vector byte multiply

dd00b22

Copilot AI review requested due to automatic review settings March 31, 2026 06:25

github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 31, 2026

dotnet-policy-service Bot added the community-contribution Indicates that the PR has been added by a community member label Mar 31, 2026

Copilot started reviewing on behalf of saucecontrol March 31, 2026 06:26 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp Outdated

Comment thread src/coreclr/jit/gentree.cpp

Comment thread src/coreclr/jit/gentree.cpp Outdated

saucecontrol added 2 commits March 30, 2026 23:42

fix comment typos

67402f8

fix eval order

380e63a

Copilot AI review requested due to automatic review settings March 31, 2026 08:41

Copilot started reviewing on behalf of saucecontrol March 31, 2026 08:42 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp Outdated

Comment thread src/coreclr/jit/gentree.cpp

Comment thread src/coreclr/jit/gentree.cpp Outdated

xtqqczze mentioned this pull request Mar 31, 2026

Incorrect JIT disasm for ConvertToVector512* #126363

Closed

This comment was marked as outdated.

Sign in to view

EgorBot mentioned this pull request Apr 4, 2026

Benchmarks for dotnet/runtime#126348 (for @saucecontrol) EgorBot/Benchmarks#84

Open

EgorBot mentioned this pull request Apr 4, 2026

Benchmarks for dotnet/runtime#126348 (for @saucecontrol) EgorBot/Benchmarks#85

Open

tannergooding reviewed Apr 6, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp Outdated

tannergooding reviewed Apr 6, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp

feedback

76dd060

This was referenced Apr 9, 2026

Directory_ReparsePoints_MountVolume test failures #125624

Closed

System.Net.NameResolution.Tests DNS failures: Name or service not known #126641

Open

This comment was marked as resolved.

Sign in to view

Merge branch 'main' into bytemul

cf01597

Copilot AI review requested due to automatic review settings April 17, 2026 17:05

Copilot started reviewing on behalf of saucecontrol April 17, 2026 17:05 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp

Comment thread src/coreclr/jit/gentree.cpp

Comment thread src/coreclr/jit/gentree.cpp

tannergooding reviewed Apr 20, 2026

View reviewed changes

Comment thread src/coreclr/jit/gentree.cpp

tannergooding approved these changes Apr 21, 2026

View reviewed changes

Merge branch 'main' into bytemul

40a4aa1

tannergooding requested review from EgorBo and kg April 21, 2026 21:20

build-analysis Bot mentioned this pull request Apr 22, 2026

browser-wasm linux Release LibraryTests queues timing out #117974

Open

Conversation

saucecontrol commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dotnet-policy-service Bot commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xtqqczze commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saucecontrol commented Mar 31, 2026

Uh oh!

saucecontrol commented Mar 31, 2026

Uh oh!

This comment was marked as outdated.

saucecontrol commented Apr 4, 2026

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saucecontrol commented Apr 17, 2026

Uh oh!

Uh oh!

tannergooding commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

saucecontrol commented Mar 31, 2026 •

edited

Loading

xtqqczze commented Mar 31, 2026 •

edited

Loading