Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Improve inling in ImmutableArray<T>.Builder#28177

Merged
stephentoub merged 1 commit into
dotnet:masterfrom
svick:immutablearray-builder-inlining
Mar 18, 2018
Merged

Improve inling in ImmutableArray<T>.Builder#28177
stephentoub merged 1 commit into
dotnet:masterfrom
svick:immutablearray-builder-inlining

Conversation

@svick
Copy link
Copy Markdown
Member

@svick svick commented Mar 18, 2018

The issue https://github.com/dotnet/corefx/issues/28064 is about a benchmark whose performance is so bad that a new dangerous method was considered to improve that situation. But almost the same effect can be achieved just by ensuring that the Add() method and the indexer setter on ImmutableArray<T>.Builder can be inlined (see https://github.com/dotnet/corefx/issues/28064#issuecomment-373950250 for more details). This PR does that.

I have only verified that extracting the throw is useful for the indexer setter. But the indexer getter and ItemRef are very similar, so I assumed it makes sense for them too.

Performance results using BenchmarkDotNet (source):

Before:

Method Mean Error StdDev Median
Add 11.053 us 0.4281 us 1.2623 us 11.004 us
Indexer 6.246 us 0.1827 us 0.5242 us 6.063 us

After:

Method Mean Error StdDev Median
Add 5.538 us 0.1310 us 0.3863 us 5.422 us
Indexer 3.159 us 0.0631 us 0.1138 us 3.188 us

Relevant portions of JIT dumps:

Before:

*************** In fgFindBasicBlocks() for Builder[Int64][System.Int64]:Add(long):this
weight= 10 : state   3 [ ldarg.0 ]
weight= 10 : state   3 [ ldarg.0 ]
weight= 79 : state  40 [ call ]
weight= 28 : state  24 [ ldc.i4.1 ]
weight=-12 : state  76 [ add ]
weight= 79 : state  40 [ call ]
weight= 31 : state 191 [ ldarg.0 -> ldfld ]
weight= 10 : state   3 [ ldarg.0 ]
weight= 31 : state 191 [ ldarg.0 -> ldfld ]
weight= 20 : state 199 [ stloc.0 -> ldloc.0 ]
weight= 28 : state  24 [ ldc.i4.1 ]
weight=-12 : state  76 [ add ]
weight= 31 : state 111 [ stfld ]
weight= 12 : state   7 [ ldloc.0 ]
weight= 16 : state   4 [ ldarg.1 ]
weight= 65 : state 141 [ stelem ]
weight= 19 : state  42 [ ret ]

Inline candidate callsite is in a loop.  Multiplier increased to 3.
calleeNativeSizeEstimate=445
callsiteNativeSizeEstimate=115
benefit multiplier=3
threshold=345
Native estimate for function size exceeds threshold for inlining 44.5 > 34.5 (multiplier = 3)


Inline expansion aborted, inline not profitable
INLINER: during 'fgInline' result 'failed this call site' reason 'unprofitable inline' for 'Bench:Add():this' calling 'Builder[Int64][System.Int64]:Add(long):this'
INLINER: during 'fgInline' result 'failed this call site' reason 'unprofitable inline'

…

INLINER impTokenLookupContextHandle for Builder[Int64][System.Int64]:set_Item(int,long):this is 0x00007FF8B1B089D1.
*************** In fgFindBasicBlocks() for Builder[Int64][System.Int64]:set_Item(int,long):this
weight= 16 : state   4 [ ldarg.1 ]
weight= 10 : state   3 [ ldarg.0 ]
weight= 79 : state  40 [ call ]
weight= 28 : state  50 [ blt.s ]
weight=227 : state 103 [ newobj ]
weight=210 : state 108 [ throw ]
weight= 31 : state 191 [ ldarg.0 -> ldfld ]
weight= 16 : state   4 [ ldarg.1 ]
weight= 35 : state   5 [ ldarg.2 ]
weight= 65 : state 141 [ stelem ]
weight= 19 : state  42 [ ret ]

Inline candidate callsite is in a loop.  Multiplier increased to 3.
calleeNativeSizeEstimate=736
callsiteNativeSizeEstimate=145
benefit multiplier=3
threshold=435
Native estimate for function size exceeds threshold for inlining 73.6 > 43.5 (multiplier = 3)


Inline expansion aborted, inline not profitable
INLINER: during 'fgInline' result 'failed this call site' reason 'unprofitable inline' for 'Bench:Indexer():this' calling 'Builder[Int64][System.Int64]:set_Item(int,long):this'
INLINER: during 'fgInline' result 'failed this call site' reason 'unprofitable inline'

After:

INLINER impTokenLookupContextHandle for Builder[Int64][System.Int64]:Add(long):this is 0x00007FF8B8698A39.
*************** In fgFindBasicBlocks() for Builder[Int64][System.Int64]:Add(long):this
weight= 31 : state 191 [ ldarg.0 -> ldfld ]
weight= 28 : state  24 [ ldc.i4.1 ]
weight=-12 : state  76 [ add ]
weight=  6 : state  11 [ stloc.0 ]
weight= 10 : state   3 [ ldarg.0 ]
weight= 12 : state   7 [ ldloc.0 ]
weight= 79 : state  40 [ call ]
weight= 31 : state 191 [ ldarg.0 -> ldfld ]
weight= 31 : state 191 [ ldarg.0 -> ldfld ]
weight= 16 : state   4 [ ldarg.1 ]
weight= 65 : state 141 [ stelem ]
weight= 10 : state   3 [ ldarg.0 ]
weight= 12 : state   7 [ ldloc.0 ]
weight= 31 : state 111 [ stfld ]
weight= 19 : state  42 [ ret ]

Inline candidate is mostly loads and stores.  Multiplier increased to 3.
Inline candidate callsite is in a loop.  Multiplier increased to 6.
calleeNativeSizeEstimate=369
callsiteNativeSizeEstimate=115
benefit multiplier=6
threshold=690
Native estimate for function size is within threshold for inlining 36.9 <= 69 (multiplier = 6)
…
Successfully inlined Builder[Int64][System.Int64]:Add(long):this (42 IL bytes) (depth 1) [profitable inline]
--------------------------------------------------------------------------------------------

INLINER: during 'fgInline' result 'success' reason 'profitable inline' for 'Bench:Add():this' calling 'Builder[Int64][System.Int64]:Add(long):this'
INLINER: during 'fgInline' result 'success' reason 'profitable inline'

…

INLINER impTokenLookupContextHandle for Builder[Int64][System.Int64]:set_Item(int,long):this is 0x00007FF8B8698A39.
*************** In fgFindBasicBlocks() for Builder[Int64][System.Int64]:set_Item(int,long):this
weight= 16 : state   4 [ ldarg.1 ]
weight= 10 : state   3 [ ldarg.0 ]
weight= 79 : state  40 [ call ]
weight= 28 : state  50 [ blt.s ]
weight= 79 : state  40 [ call ]
weight= 31 : state 191 [ ldarg.0 -> ldfld ]
weight= 16 : state   4 [ ldarg.1 ]
weight= 35 : state   5 [ ldarg.2 ]
weight= 65 : state 141 [ stelem ]
weight= 19 : state  42 [ ret ]

Inline candidate callsite is in a loop.  Multiplier increased to 3.
calleeNativeSizeEstimate=378
callsiteNativeSizeEstimate=145
benefit multiplier=3
threshold=435
Native estimate for function size is within threshold for inlining 37.8 <= 43.5 (multiplier = 3)
…
Successfully inlined Builder[Int64][System.Int64]:set_Item(int,long):this (28 IL bytes) (depth 1) [profitable inline]
--------------------------------------------------------------------------------------------

INLINER: during 'fgInline' result 'success' reason 'profitable inline' for 'Bench:Indexer():this' calling 'Builder[Int64][System.Int64]:set_Item(int,long):this'
INLINER: during 'fgInline' result 'success' reason 'profitable inline'

Copy link
Copy Markdown
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks.

@stephentoub
Copy link
Copy Markdown
Member

cc: @AArnott, @jaredpar

@svick
Copy link
Copy Markdown
Member Author

svick commented Mar 18, 2018

@dotnet-bot test Linux x64 Release Build please
@dotnet-bot test OSX x64 Debug Build please

{
this.EnsureCapacity(this.Count + 1);
_elements[_count++] = item;
int newCount = _count + 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can optimize Add even further when you split it in a fast- and cold-path. Like

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public void Add(T item)
{
	int count = _count;
	T[] elements = _elements;

	if ((uint)count < (uint)elements.Length)
	{
		elements[count] = item;
		_count = count + 1;
	}
	else
	{
		AddWithResize(item);
	}
}

// Improve code quality as uncommon path
[MethodImpl(MethodImplOptions.NoInlining)]
private void AddWithResize(T item)
{
	int newCount = _count + 1;
	this.EnsureCapacity(newCount);
	_elements[_count] = item;
	_count = newCount;
}

Note that Add has to be attributed with AggressiveInlining, because -- interestingly though less asm-size -- the JIT won't inline otherwise.
This also safes the bounds-check in the fast-path.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfoidl That's a significant increase in complexity. Do you have something that shows that it's actually sufficiently faster to warrant that?

I made a quick benchmark and it seems your code is actually slower (SimpleAdd is the original code, TweakedAdd is code with my change, SplitAdd is your change):

Method Mean Error StdDev
SimpleAdd 7.935 us 0.0070 us 0.0055 us
TweakedAdd 3.274 us 0.0547 us 0.0512 us
SplitAdd 3.419 us 0.0322 us 0.0302 us

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have something that shows that it's actually sufficiently faster to warrant that?

List.Add, Stack.Push and similar classes use this pattern and there it is faster. My suggestion is based on that experience.

When I run your benchmark the results look a bit different.

Linux:

BenchmarkDotNet=v0.10.11, OS=ubuntu 16.04
Processor=Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), ProcessorCount=4
.NET Core SDK=2.1.300-preview1-008174
  [Host]     : .NET Core 2.1.0-preview1-26216-03 (Framework 4.6.26216.04), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0-preview1-26216-03 (Framework 4.6.26216.04), 64bit RyuJIT

Method Mean Error StdDev Median
SimpleAdd 4.823 us 0.0891 us 0.0790 us 4.816 us
TweakedAdd 2.547 us 0.0521 us 0.1318 us 2.505 us
SplitAdd 2.232 us 0.0434 us 0.0426 us 2.228 us

Windows:

BenchmarkDotNet=v0.10.11, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.309)
Processor=Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), ProcessorCount=8
Frequency=2742189 Hz, Resolution=364.6722 ns, Timer=TSC
.NET Core SDK=2.1.300-preview1-008174
  [Host]     : .NET Core 2.1.0-preview1-26216-03 (Framework 4.6.26216.04), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0-preview1-26216-03 (Framework 4.6.26216.04), 64bit RyuJIT

Method Mean Error StdDev
SimpleAdd 4.192 us 0.0240 us 0.0213 us
TweakedAdd 2.178 us 0.0203 us 0.0190 us
SplitAdd 1.950 us 0.0140 us 0.0124 us

Can you please re-check with your benchmark?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfoidl I'm still getting SplitAdd as slower on my machine:

BenchmarkDotNet=v0.10.11, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.309)
Processor=Intel Core i5-2300 CPU 2.80GHz (Sandy Bridge), ProcessorCount=4
Frequency=2727538 Hz, Resolution=366.6310 ns, Timer=TSC
.NET Core SDK=2.1.300-preview1-008174
  [Host]     : .NET Core ? (Framework 4.6.26216.04), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0-preview1-26216-03 (Framework 4.6.26216.04), 64bit RyuJIT

Method Mean Error StdDev Median
SimpleAdd 8.267 us 0.1789 us 0.3888 us 8.103 us
TweakedAdd 3.358 us 0.0667 us 0.0935 us 3.381 us
SplitAdd 3.807 us 0.0816 us 0.2354 us 3.793 us

It could be caused by the different CPU, or maybe something else. Though even with your numbers, it's only 10 % improvement, so I'm not sure it's worth that much additional complexity.

Because of that, I'm not going to change my PR to include your changes. Instead, you could open your own PR.

@svick
Copy link
Copy Markdown
Member Author

svick commented Mar 18, 2018

@dotnet-bot test OSX x64 Debug Build please

Copy link
Copy Markdown
Member

@karelz karelz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@karelz
Copy link
Copy Markdown
Member

karelz commented Mar 18, 2018

Do we still need the original issue? Is it still tracking a meaningful improvement over your latest change?

cc @valenis @brianrob

@karelz karelz added the tenet-performance Performance related issue label Mar 18, 2018
@stephentoub stephentoub merged commit b5c1cee into dotnet:master Mar 18, 2018
@svick svick deleted the immutablearray-builder-inlining branch March 18, 2018 20:26
@svick
Copy link
Copy Markdown
Member Author

svick commented Mar 18, 2018

@karelz Like I said in my comment on the issue (https://github.com/dotnet/corefx/issues/28064#issuecomment-373950250), my opinion is that with this change, the proposed UnsafeFreeze API is not necessary, because the performance improvement would be relatively small.

But I don't know if that means the issue should be closed.

@jaredpar
Copy link
Copy Markdown
Member

CC @VSadov

@karelz karelz added this to the 2.1.0 milestone Mar 27, 2018
ericstj pushed a commit to ericstj/corefx that referenced this pull request Mar 28, 2018
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants