Add LargeArrayBuilder type to enscapulate ToArray logic#13076
Add LargeArrayBuilder type to enscapulate ToArray logic#13076stephentoub merged 8 commits intodotnet:masterfrom
Conversation
There was a problem hiding this comment.
I found out this wasn't getting inlined, so I tagged it with AggressiveInlining. However, it turns out that doing so generates ~50 more bytes of code for each call site to Add, and the effect is amplified since we're using the builder in a generic context.
So my solution was to add a SlowAdd method wrapper around this that was tagged with NoInlining. Add is used in places where it is part of the bottleneck (such as in loops), otherwise SlowAdd is used to minimize code bloat.
There was a problem hiding this comment.
I found out this wasn't getting inlined, so I tagged it with AggressiveInlining
Why was it important that it get inlined? Do you have a benchmark that shows a measurable improvement in key scenarios from that happening? Otherwise I'd rather let the JIT do its thing and not complicate the API here with two Add methods that are identical other than in inlining.
cc: @AndyAyersMS
There was a problem hiding this comment.
@stephentoub, will post when I get a chance.
There was a problem hiding this comment.
Yes, removing the AggressiveInlining indeed does result in a perf regression. I wrote this app to measure the throughput with/without the AggressiveInlining; removing the attribute results in a significant (~30%) slowdown.
There was a problem hiding this comment.
removing the attribute results in a significant (~30%) slowdow
You're comparing release builds, and the only difference between them is whether Add has AggressiveInlining on it? I just tried with your PR (with and without that one line with AggressiveInlining), and while I do see an improvement that comes from the PR in general, I don't see any such difference between whether Add has AggressiveInlining on it.
There was a problem hiding this comment.
Yes. What test app are you using to benchmark?
Same as you. I've tried a few other things, though, and I am able to construct a simple console app that sees about a ~15% improvement from the AggressiveInlining, so while not 30%, it does seem worthwhile to keep it for now.
There was a problem hiding this comment.
@stephentoub, good. My percentages may have been skewed a bit since I was comparing the smaller times in relation to the larger times instead of vice versa (1100 is a 20% regression from 900, 900 is an 18% improvement from 1100) so I will have to be careful about that in the future.
There was a problem hiding this comment.
Sorry, @jamesqo just saw this now.... did you ever drill deeper into the codegen with and without forced inlining?
There was a problem hiding this comment.
@AndyAyersMS Sorry, I forgot about your comment until now. I do have a gist from around the time I made this PR: https://gist.github.com/jamesqo/e46cdf95bd416220c48e16583fd91f6c The method being jitted is AppendPrepend1Iterator.LazyToArray, which you can find in the diffs.
- NoInline.asm is when AggressiveInlining is not applied and the code looks like
if (!_appending) { builder.Add(_item); }
builder.AddRange(_source);
if (_appending) { builder.Add(_item); }-
WithInline.asm is that same code sample, except when AgressiveInlining is applied to
Add -
WithInline2.asm is when the code is instead written like
if (_appending) { builder.AddRange(_source); }
builder.Add(_item);
if (!_appending) { builder.AddRange(_source); }That's all the investigation I did.
There was a problem hiding this comment.
Ok, thanks. Not obvious from just this where the win came from. I'd need to go back and look at what Add compiles to when it's not inlined.
There was a problem hiding this comment.
self-note: Remove this variable
|
@karelz, FYI I'd say this PR is more related to System.Linq than System.Collections. Would be nice if you could relabel / change assignees. |
|
I see the code changes in both -- Collections and Linq. I will leave area owners/experts to decide who is in best position to do the code review. |
There was a problem hiding this comment.
TODO: It may be worth eliminating this and just writing the whole thing inline in AddRange(IEnumerable). The one place where it's used can be eliminated by adding some extra methods to SingleLinkedNode.
There was a problem hiding this comment.
TODO: It may be worth eliminating this and just writing the whole thing inline in AddRange(IEnumerable). The one place where it's used can be eliminated by adding some extra methods to SingleLinkedNode.
355474b to
6c76751
Compare
|
@stephentoub Do you think it would be possible for you to review this when you have time? Thanks. |
|
@ianhays @Priya91 @VSadov @OmarTawfik ping? Who's the best person to review the change? |
There was a problem hiding this comment.
Why did you rename this from GenerateEnumerable to GenerateSequence? Seems like it was fine named what it was, maybe even better as Enumerable.
There was a problem hiding this comment.
Would this not be better / simpler as:
if (_appending)
{
builder.AddRange(_source);
builder.SlowAdd(_item);
}
else
{
builder.SlowAdd(_item);
builder.AddRange(_source);
}?
There was a problem hiding this comment.
@stephentoub, this is the same pattern that is used in other places throughout the file, e.g. here.
There was a problem hiding this comment.
Ok. For consistency it's worth keeping it like this for now, but I think it'd also be worth following up to see whether it'd be worth changing all of them to what, at least to me, looks like a simpler and more understandable pattern. I don't know what the JIT'd code looks like it, but it's possible there's also a perf benefit to one or the other, either in number of actual branches or in IL size or something like that.
There was a problem hiding this comment.
I don't know what the JIT'd code looks like it, but it's possible there's also a perf benefit to one or the other,
Note that _appending is readonly, so the JIT should be able to cache the field access in a register there and save some code size. I haven't checked, though.
There was a problem hiding this comment.
Why not keep the cast as it was?
return ((IEnumerale<TSource>)ToArray()).GetEnumerator();There was a problem hiding this comment.
I personally like AsEnumerable better, but sure.
There was a problem hiding this comment.
Why separate this out into its own (public) method? Where else is it being used?
There was a problem hiding this comment.
Why are you creating this intermediate array? Don't we know how long _appended is from the count stored in each node? Can't we just iterate through the list, adding each node's item to the array from the last to the first, as is done for _appended elsewhere?
There was a problem hiding this comment.
@stephentoub We cannot presize the array here since _source is lazy. Since we can't presize the array, we don't know if there will be enough room to fit all of the appended items-- we might have to do another allocation in the builder.
There was a problem hiding this comment.
Was such an array getting allocated before? I'm struggling to see this as a pure win if we're introducing additional intermediate ToArrays like this. If it was already there before, then ok, though it would be nice subsequently to find a way to avoid it.
There was a problem hiding this comment.
@stephentoub Yes. The logic in this method is basically a copy verbatim of what MoveNext currently does.
There was a problem hiding this comment.
Yes. The logic in this method is basically a copy verbatim of what MoveNext currently does.
Ugh. Ok. Thanks.
There was a problem hiding this comment.
Why separate this out? The other cases in ToArray aren't in their own methods, seems like this should just be inline with the others, which then also removes the need for the assert.
There was a problem hiding this comment.
@stephentoub, ToArray is generic so it can be regenerated by the JIT. The logic in LazyToArray seemed big enough that it seemed worth separating out into a new method, to avoid penalizing code that doesn't go down this branch. I'll post the generated code size of this method when I get time.
There was a problem hiding this comment.
Just checked. LazyToArray comes in at an extra 306 bytes of code: https://gist.github.com/jamesqo/6b5f2cc27ca36fd3f488c7be57a87cb4 It would probably be wise to avoid generating all that unnecessary code if it's never called.
There was a problem hiding this comment.
It would probably be wise to avoid generating all that unnecessary code if it's never called.
Why is that the case for "case -1" but not the case for "default" in ToArray? My point is that we should be consistent here.
There was a problem hiding this comment.
@stephentoub, I separated the default: case into a new PreallocatingToArray method.
There was a problem hiding this comment.
What does the comment "Is _first <= ResizeLimit." mean?
There was a problem hiding this comment.
I meant to say, "If _count <= ResizeLimit then _first == _current." I'll rephrase that.
There was a problem hiding this comment.
Instead of the : this(), how about just explicitly initializing the remaining fields (_buffers, _index, _count)?
There was a problem hiding this comment.
@stephentoub, is explicit initialization faster than using the default struct constructor?
There was a problem hiding this comment.
I just checked. If we initialize members explicitly then the constructor doesn't get inlined:
G_M36021_IG02:
488D4DC8 lea rcx, bword ptr [rbp-38H]
BA01000000 mov edx, 1
E88CFAFFFF call LargeArrayBuilder`1:.ctor(bool):thisWhile this isn't the bottleneck, I think using : this() looks more readable anyway.
There was a problem hiding this comment.
I think using : this() looks more readable anyway.
I personally disagree. To me this looks like we're initializing some of the fields multiple times, as delegating to "this()" needs to ensure that all of the fields are initialized, and then we subsequently overwrite some of them (it looks to me the same as having a bunch of fields in a class that are explicitly initialized to default(T) at the field declaration site and then subsequently overwritten in an explicit ctor). We also don't use this pattern elsewhere in corefx, which makes this stand out to me like a sore thumb.
If there's actually an inlining difference and it matters, that's something that should be addressed in the JIT.
cc: @AndyAyersMS
There was a problem hiding this comment.
it looks to me the same as having a bunch of fields in a class that are explicitly initialized to default(T) at the field declaration site and then subsequently overwritten in an explicit ctor
If you think about it, that can be faster since the zeroing can be done in bulk vs. initializing fields 1 at a time. For example, on my machine the jit generates code to use ymm registers:
C4E17957C0 vxorpd ymm0, ymm0
C4E17A7F02 vmovdqu qword ptr [rdx], ymm0
C4E17A7F4210 vmovdqu qword ptr [rdx+16], ymm0If there's actually an inlining difference and it matters, that's something that should be addressed in the JIT.
Considering things from the viewpoint of the JIT, isn't the constructor just a bunch of arbitrary code? So then when it looks at the constructor and sees that it has a huge body, it might not decide to inline even though all of the code is just default-initializing the struct. It would seem that if the JIT were to try to inline this, it would waste time for other cases where that arbitrary code isn't all just default-initialization.
I'm not really familiar with how the JIT works, but just a guess... cc @mikedn, @benaadams
There was a problem hiding this comment.
If you think about it, that can be faster since the zeroing can be done in bulk vs. initializing fields 1 at a time
I'm not talking about what actually happens in the implementation; you were talking about readability, and part of readability is being able to reason about what's happening, so I was sharing what immediately comes to my mind when I see that code, regardless of what the compiler does, and hence why for me it harms readability, along with the fact that readability is helped by seeing common patterns and harmed by seeing jarring patterns, and to me, this use of ": this()" in a struct is jarring.
A compiler can choose to optimize certain patterns, so the compiler could choose to generate the exact same code for:
public SomeStruct()
{
_field1 = 0;
_field2 = null;
_field3 = Calculate();
}as it does for:
public SomeStruct() : this()
{
_field3 = Calculate();
}
There was a problem hiding this comment.
along with the fact that readability is helped by seeing common patterns and harmed by seeing jarring patterns, and to me, this use of ": this()" in a struct is jarring.
OK, I can remove the explicit : this() call then. 👍
A compiler can choose to optimize certain patterns, so the compiler could choose to generate the exact same code for:
Yes. I was trying to say that with the first version the JIT would (probably) have to inspect more bytecode to see that 0 was being assigned to _field1, and null was being assigned to _field2, before making that optimization (which would presumably come at a cost since all of this is done at runtime).
There was a problem hiding this comment.
Why have AllocateBuffer return _current? We don't always use the result... why not just have the call sites that need _current access _current?
There was a problem hiding this comment.
Not before we get an OOME.
There was a problem hiding this comment.
Not before we get an OOME.
Won't it overflow when destination.Length == 0x40000000?
There was a problem hiding this comment.
@stephentoub, good point. In that case, however, AllocateBuffer will attempt to allocate an array of size 0x40000000 * 2, which will raise an OverflowException anyways. Also, if we've reached that point then that would mean we've added 0x80000000 items, meaning ToArray couldn't possibly work. Idk if it's worth adding an extra checked block which has no effect, and protects against a case which is already broken.
There was a problem hiding this comment.
however, AllocateBuffer will attempt to allocate an array of size 0x40000000 * 2, which will raise an OverflowException anyways
Will it? I've not tried it, but it looks like it would actually see _count as negative, which is less than ResizeLimit, so it would enter the "we're still less than ResizeLimit" block. Maybe those checks need to be made robust.
There was a problem hiding this comment.
Ah, very nice catch. It looks like we will multiply _count by 2 again, and 0x80000000 * 2 is 0, so then we allocate a 0-length array and set it to _first / _current instead of raising an exception. Since it doesn't add any additional cost, I can change the check to be if ((uint)_count < (uint)ResizeLimit) which should get us down the path which throws for overflow.
|
Thanks, @jamesqo. I left some comments to be addressed, but overall it looks good. |
|
@stephentoub, thanks for reviewing. I've responded to all of your comments. |
|
Test Innerloop OSX Debug Build and Test |
|
Test Innerloop Linux ARM Emulator SoftFP Debug Cross Build |
The
EnumerableHelpers.ToArraylogic I introduced in #11208, while better in terms of memory consumption, employs a very imperative code style. I refactored the core logic into a new struct calledLargeArrayBuilderso now the method looks likeAs a result of this refactoring, we can use the buffering strategy employed by
ToArrayin other places where we were previously usingEnumerableHelpers.ToArray(this)(such asSelect.ToArrayorConcat.ToArray) and simultaneously avoid some unnecessary indirection/field stores. This results in substantial speedup.Perf testing and analysis: test code, data.
The following patterns all see substantial speedups (where
eis a lazy enumerable):e.Concat.ToArraye.Append.ToArraye.Append.Append.ToArraye.Select.ToArraye.OrderBy.Select.ToArray(OrderBy returns an IPartition that sometimes can't get its count cheaply)Once this is merged I will change #12703 to make
Whereuse it too, so allWhere.ToArraycalls should be substantially faster.cc @stephentoub @jkotas @VSadov @JonHanna