Skip to content

Extra zeroing with structs and inlining #8186

@RossNordby

Description

@RossNordby

I've run into some unnecessary initializations on struct locals with definitely assigned fields, plus some more when inlining is thrown into the mix. This is showing up in profiles of some inner loops- on my main test case, the initializations end up zeroing almost 100 megabytes every frame. It's not a huge slowdown (about 2.5%), but it would be nice to avoid.

A few test cases:

  1. Struct local within a function with NoInlining, called in a loop.
        struct StructType
        {
            public Vector<float> A;
            public Vector<float> B;
            public Vector<float> C;
            public Vector<float> D;
            public Vector<float> E;
            public Vector<float> F;
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        static void DoSomeWorkWithAStruct(ref Vector<float> source, out Vector<float> result)
        {
            StructType u;
            u.A = new Vector<float>(2) * source;
            u.B = new Vector<float>(3) * source;
            u.C = new Vector<float>(4) * source;
            u.D = new Vector<float>(5) * source;
            u.E = new Vector<float>(6) * source;
            u.F = new Vector<float>(7) * source;
            result = u.A + u.B + u.C + u.D + u.E + u.F;
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        static void TestStruct()
        {
            Vector<float> f;
            for (int i = 0; i < 100; ++i)
            {
                DoSomeWorkWithAStruct(ref f, out f);
            }
        }

DoSomeWorkWithAStruct initializes the struct with a rep stos over 96 bytes. What happens if DoSomeWorkWithAStruct uses...

  1. AggressiveInlining.

There's now a 96 byte rep stos before the loop begins, but there's also another zeroing that occurs for every iteration:

xorpd       xmm1,xmm1  
movdqu      xmmword ptr [rdx],xmm1  
movdqu      xmmword ptr [rdx+10h],xmm1  
movdqu      xmmword ptr [rdx+20h],xmm1  
movdqu      xmmword ptr [rdx+30h],xmm1  
movdqu      xmmword ptr [rdx+40h],xmm1  
movdqu      xmmword ptr [rdx+50h],xmm1  

Oop. How about...

  1. Manual inlining.
        [MethodImpl(MethodImplOptions.NoInlining)]
        static void TestStructManuallyInlined()
        {
            Vector<float> f;
            for (int i = 0; i < 100; ++i)
            {
                StructType u;
                u.A = new Vector<float>(2) * f;
                u.B = new Vector<float>(3) * f;
                u.C = new Vector<float>(4) * f;
                u.D = new Vector<float>(5) * f;
                u.E = new Vector<float>(6) * f;
                u.F = new Vector<float>(7) * f;
                f = u.A + u.B + u.C + u.D + u.E + u.F;
            }
        }

Still a rep stos outside the loop, but it's not a big deal since it gets amortized over all the iterations. No inner zeroing. Finally, compare to...

  1. No struct local, same number of variables, called with AggressiveInlining.
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        static void DoSomeWorkStructless(ref Vector<float> source, out Vector<float> result)
        {
            var a = new Vector<float>(2) * source;
            var b = new Vector<float>(3) * source;
            var c = new Vector<float>(4) * source;
            var d = new Vector<float>(5) * source;
            var e = new Vector<float>(6) * source;
            var f = new Vector<float>(7) * source;
            result = d + e + f + a + b + c;
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        static void TestStructless()
        {
            Vector<float> f;
            for (int i = 0; i < 100; ++i)
            {
                DoSomeWorkStructless(ref f, out f);
            }
        }

No zeroing!

While there are cases where applying workarounds in the form of options 3 or 4 are feasible, there are many cases where the extra complexity makes it impractical. In those cases, it would be useful to avoid the extra zeroing.

Tested on NETCore.App 2.0.0-preview2-25309-07. These and some other related test cases available over here.

(By the way, jumping from the previous desktop version up to latest daily builds improved performance 30-40% in many cases, and up to 52% in some simulations- awesome work!)
category:cq
theme:structs
skill-level:expert
cost:medium

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIenhancementProduct code improvement that does NOT require public API changes/additionsoptimizationtenet-performancePerformance related issue

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions