Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Add LargeArrayBuilder type to enscapulate ToArray logic#13076

Merged
stephentoub merged 8 commits intodotnet:masterfrom
jamesqo:large-arraybuilder
Nov 9, 2016
Merged

Add LargeArrayBuilder type to enscapulate ToArray logic#13076
stephentoub merged 8 commits intodotnet:masterfrom
jamesqo:large-arraybuilder

Conversation

@jamesqo
Copy link
Copy Markdown
Contributor

@jamesqo jamesqo commented Oct 27, 2016

The EnumerableHelpers.ToArray logic I introduced in #11208, while better in terms of memory consumption, employs a very imperative code style. I refactored the core logic into a new struct called LargeArrayBuilder so now the method looks like

// Check for ICollection

var builder = new LargeArrayBuilder(initialize: true);
builder.AddRange(source);
return builder.ToArray();

As a result of this refactoring, we can use the buffering strategy employed by ToArray in other places where we were previously using EnumerableHelpers.ToArray(this) (such as Select.ToArray or Concat.ToArray) and simultaneously avoid some unnecessary indirection/field stores. This results in substantial speedup.

Perf testing and analysis: test code, data.

The following patterns all see substantial speedups (where e is a lazy enumerable):

  • e.Concat.ToArray
  • e.Append.ToArray
  • e.Append.Append.ToArray
  • e.Select.ToArray
  • e.OrderBy.Select.ToArray (OrderBy returns an IPartition that sometimes can't get its count cheaply)

Once this is merged I will change #12703 to make Where use it too, so all Where.ToArray calls should be substantially faster.

cc @stephentoub @jkotas @VSadov @JonHanna

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out this wasn't getting inlined, so I tagged it with AggressiveInlining. However, it turns out that doing so generates ~50 more bytes of code for each call site to Add, and the effect is amplified since we're using the builder in a generic context.

So my solution was to add a SlowAdd method wrapper around this that was tagged with NoInlining. Add is used in places where it is part of the bottleneck (such as in loops), otherwise SlowAdd is used to minimize code bloat.

Copy link
Copy Markdown
Member

@stephentoub stephentoub Nov 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out this wasn't getting inlined, so I tagged it with AggressiveInlining

Why was it important that it get inlined? Do you have a benchmark that shows a measurable improvement in key scenarios from that happening? Otherwise I'd rather let the JIT do its thing and not complicate the API here with two Add methods that are identical other than in inlining.
cc: @AndyAyersMS

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub, will post when I get a chance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, removing the AggressiveInlining indeed does result in a perf regression. I wrote this app to measure the throughput with/without the AggressiveInlining; removing the attribute results in a significant (~30%) slowdown.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing the attribute results in a significant (~30%) slowdow

You're comparing release builds, and the only difference between them is whether Add has AggressiveInlining on it? I just tried with your PR (with and without that one line with AggressiveInlining), and while I do see an improvement that comes from the PR in general, I don't see any such difference between whether Add has AggressiveInlining on it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. What test app are you using to benchmark?

Same as you. I've tried a few other things, though, and I am able to construct a simple console app that sees about a ~15% improvement from the AggressiveInlining, so while not 30%, it does seem worthwhile to keep it for now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub, good. My percentages may have been skewed a bit since I was comparing the smaller times in relation to the larger times instead of vice versa (1100 is a 20% regression from 900, 900 is an 18% improvement from 1100) so I will have to be careful about that in the future.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, @jamesqo just saw this now.... did you ever drill deeper into the codegen with and without forced inlining?

Copy link
Copy Markdown
Contributor Author

@jamesqo jamesqo Nov 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndyAyersMS Sorry, I forgot about your comment until now. I do have a gist from around the time I made this PR: https://gist.github.com/jamesqo/e46cdf95bd416220c48e16583fd91f6c The method being jitted is AppendPrepend1Iterator.LazyToArray, which you can find in the diffs.

  • NoInline.asm is when AggressiveInlining is not applied and the code looks like
if (!_appending) { builder.Add(_item); }
builder.AddRange(_source);
if (_appending) { builder.Add(_item); }
  • WithInline.asm is that same code sample, except when AgressiveInlining is applied to Add

  • WithInline2.asm is when the code is instead written like

if (_appending) { builder.AddRange(_source); }
builder.Add(_item);
if (!_appending) { builder.AddRange(_source); }

That's all the investigation I did.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. Not obvious from just this where the win came from. I'd need to go back and look at what Add compiles to when it's not inlined.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-note: Remove this variable

@karelz
Copy link
Copy Markdown
Member

karelz commented Oct 29, 2016

cc: @Priya91 @ianhays

@jamesqo
Copy link
Copy Markdown
Contributor Author

jamesqo commented Oct 30, 2016

@karelz, FYI I'd say this PR is more related to System.Linq than System.Collections. Would be nice if you could relabel / change assignees.

@karelz
Copy link
Copy Markdown
Member

karelz commented Oct 31, 2016

I see the code changes in both -- Collections and Linq. I will leave area owners/experts to decide who is in best position to do the code review.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: It may be worth eliminating this and just writing the whole thing inline in AddRange(IEnumerable). The one place where it's used can be eliminated by adding some extra methods to SingleLinkedNode.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: It may be worth eliminating this and just writing the whole thing inline in AddRange(IEnumerable). The one place where it's used can be eliminated by adding some extra methods to SingleLinkedNode.

@jamesqo jamesqo force-pushed the large-arraybuilder branch from 355474b to 6c76751 Compare November 2, 2016 23:30
@jamesqo
Copy link
Copy Markdown
Contributor Author

jamesqo commented Nov 2, 2016

@stephentoub Do you think it would be possible for you to review this when you have time? Thanks.

@karelz
Copy link
Copy Markdown
Member

karelz commented Nov 2, 2016

@ianhays @Priya91 @VSadov can you please review the code?

@karelz
Copy link
Copy Markdown
Member

karelz commented Nov 6, 2016

@ianhays @Priya91 @VSadov @OmarTawfik ping? Who's the best person to review the change?

Copy link
Copy Markdown
Member

@stephentoub stephentoub Nov 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you rename this from GenerateEnumerable to GenerateSequence? Seems like it was fine named what it was, maybe even better as Enumerable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub, can rename back.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this not be better / simpler as:

if (_appending)
{
    builder.AddRange(_source);
    builder.SlowAdd(_item);
}
else
{
    builder.SlowAdd(_item);
    builder.AddRange(_source);
}

?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub, this is the same pattern that is used in other places throughout the file, e.g. here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. For consistency it's worth keeping it like this for now, but I think it'd also be worth following up to see whether it'd be worth changing all of them to what, at least to me, looks like a simpler and more understandable pattern. I don't know what the JIT'd code looks like it, but it's possible there's also a perf benefit to one or the other, either in number of actual branches or in IL size or something like that.

Copy link
Copy Markdown
Contributor Author

@jamesqo jamesqo Nov 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub

I don't know what the JIT'd code looks like it, but it's possible there's also a perf benefit to one or the other,

Note that _appending is readonly, so the JIT should be able to cache the field access in a register there and save some code size. I haven't checked, though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep the cast as it was?

return ((IEnumerale<TSource>)ToArray()).GetEnumerator();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like AsEnumerable better, but sure.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why separate this out into its own (public) method? Where else is it being used?

Copy link
Copy Markdown
Member

@stephentoub stephentoub Nov 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you creating this intermediate array? Don't we know how long _appended is from the count stored in each node? Can't we just iterate through the list, adding each node's item to the array from the last to the first, as is done for _appended elsewhere?

Copy link
Copy Markdown
Contributor Author

@jamesqo jamesqo Nov 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub We cannot presize the array here since _source is lazy. Since we can't presize the array, we don't know if there will be enough room to fit all of the appended items-- we might have to do another allocation in the builder.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was such an array getting allocated before? I'm struggling to see this as a pure win if we're introducing additional intermediate ToArrays like this. If it was already there before, then ok, though it would be nice subsequently to find a way to avoid it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub Yes. The logic in this method is basically a copy verbatim of what MoveNext currently does.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The logic in this method is basically a copy verbatim of what MoveNext currently does.

Ugh. Ok. Thanks.

Copy link
Copy Markdown
Member

@stephentoub stephentoub Nov 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why separate this out? The other cases in ToArray aren't in their own methods, seems like this should just be inline with the others, which then also removes the need for the assert.

Copy link
Copy Markdown
Contributor Author

@jamesqo jamesqo Nov 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub, ToArray is generic so it can be regenerated by the JIT. The logic in LazyToArray seemed big enough that it seemed worth separating out into a new method, to avoid penalizing code that doesn't go down this branch. I'll post the generated code size of this method when I get time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked. LazyToArray comes in at an extra 306 bytes of code: https://gist.github.com/jamesqo/6b5f2cc27ca36fd3f488c7be57a87cb4 It would probably be wise to avoid generating all that unnecessary code if it's never called.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be wise to avoid generating all that unnecessary code if it's never called.

Why is that the case for "case -1" but not the case for "default" in ToArray? My point is that we should be consistent here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub, I separated the default: case into a new PreallocatingToArray method.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the comment "Is _first <= ResizeLimit." mean?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to say, "If _count <= ResizeLimit then _first == _current." I'll rephrase that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the : this(), how about just explicitly initializing the remaining fields (_buffers, _index, _count)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub, is explicit initialization faster than using the default struct constructor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked. If we initialize members explicitly then the constructor doesn't get inlined:

G_M36021_IG02:
       488D4DC8             lea      rcx, bword ptr [rbp-38H]
       BA01000000           mov      edx, 1
       E88CFAFFFF           call     LargeArrayBuilder`1:.ctor(bool):this

While this isn't the bottleneck, I think using : this() looks more readable anyway.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using : this() looks more readable anyway.

I personally disagree. To me this looks like we're initializing some of the fields multiple times, as delegating to "this()" needs to ensure that all of the fields are initialized, and then we subsequently overwrite some of them (it looks to me the same as having a bunch of fields in a class that are explicitly initialized to default(T) at the field declaration site and then subsequently overwritten in an explicit ctor). We also don't use this pattern elsewhere in corefx, which makes this stand out to me like a sore thumb.

If there's actually an inlining difference and it matters, that's something that should be addressed in the JIT.
cc: @AndyAyersMS

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub

it looks to me the same as having a bunch of fields in a class that are explicitly initialized to default(T) at the field declaration site and then subsequently overwritten in an explicit ctor

If you think about it, that can be faster since the zeroing can be done in bulk vs. initializing fields 1 at a time. For example, on my machine the jit generates code to use ymm registers:

       C4E17957C0           vxorpd   ymm0, ymm0
       C4E17A7F02           vmovdqu  qword ptr [rdx], ymm0
       C4E17A7F4210         vmovdqu  qword ptr [rdx+16], ymm0

If there's actually an inlining difference and it matters, that's something that should be addressed in the JIT.

Considering things from the viewpoint of the JIT, isn't the constructor just a bunch of arbitrary code? So then when it looks at the constructor and sees that it has a huge body, it might not decide to inline even though all of the code is just default-initializing the struct. It would seem that if the JIT were to try to inline this, it would waste time for other cases where that arbitrary code isn't all just default-initialization.

I'm not really familiar with how the JIT works, but just a guess... cc @mikedn, @benaadams

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think about it, that can be faster since the zeroing can be done in bulk vs. initializing fields 1 at a time

I'm not talking about what actually happens in the implementation; you were talking about readability, and part of readability is being able to reason about what's happening, so I was sharing what immediately comes to my mind when I see that code, regardless of what the compiler does, and hence why for me it harms readability, along with the fact that readability is helped by seeing common patterns and harmed by seeing jarring patterns, and to me, this use of ": this()" in a struct is jarring.

A compiler can choose to optimize certain patterns, so the compiler could choose to generate the exact same code for:

public SomeStruct()
{
    _field1 = 0;
    _field2 = null;
    _field3 = Calculate();
}

as it does for:

public SomeStruct() : this()
{
    _field3 = Calculate();
}

Copy link
Copy Markdown
Contributor Author

@jamesqo jamesqo Nov 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

along with the fact that readability is helped by seeing common patterns and harmed by seeing jarring patterns, and to me, this use of ": this()" in a struct is jarring.

OK, I can remove the explicit : this() call then. 👍

A compiler can choose to optimize certain patterns, so the compiler could choose to generate the exact same code for:

Yes. I was trying to say that with the first version the JIT would (probably) have to inspect more bytecode to see that 0 was being assigned to _field1, and null was being assigned to _field2, before making that optimization (which would presumably come at a cost since all of this is done at runtime).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have AllocateBuffer return _current? We don't always use the result... why not just have the call sites that need _current access _current?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this overflow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not before we get an OOME.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not before we get an OOME.

Won't it overflow when destination.Length == 0x40000000?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub, good point. In that case, however, AllocateBuffer will attempt to allocate an array of size 0x40000000 * 2, which will raise an OverflowException anyways. Also, if we've reached that point then that would mean we've added 0x80000000 items, meaning ToArray couldn't possibly work. Idk if it's worth adding an extra checked block which has no effect, and protects against a case which is already broken.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, AllocateBuffer will attempt to allocate an array of size 0x40000000 * 2, which will raise an OverflowException anyways

Will it? I've not tried it, but it looks like it would actually see _count as negative, which is less than ResizeLimit, so it would enter the "we're still less than ResizeLimit" block. Maybe those checks need to be made robust.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, very nice catch. It looks like we will multiply _count by 2 again, and 0x80000000 * 2 is 0, so then we allocate a 0-length array and set it to _first / _current instead of raising an exception. Since it doesn't add any additional cost, I can change the check to be if ((uint)_count < (uint)ResizeLimit) which should get us down the path which throws for overflow.

@stephentoub
Copy link
Copy Markdown
Member

Thanks, @jamesqo. I left some comments to be addressed, but overall it looks good.

@jamesqo
Copy link
Copy Markdown
Contributor Author

jamesqo commented Nov 8, 2016

@stephentoub, thanks for reviewing. I've responded to all of your comments.

@jamesqo
Copy link
Copy Markdown
Contributor Author

jamesqo commented Nov 8, 2016

Test Innerloop OSX Debug Build and Test

@jamesqo
Copy link
Copy Markdown
Contributor Author

jamesqo commented Nov 8, 2016

Test Innerloop Linux ARM Emulator SoftFP Debug Cross Build
Test Innerloop Linux ARM Emulator SoftFP Release Cross Build

Copy link
Copy Markdown
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jamesqo.

@stephentoub stephentoub merged commit 8ee5cfd into dotnet:master Nov 9, 2016
@jamesqo jamesqo deleted the large-arraybuilder branch November 9, 2016 11:22
@karelz karelz modified the milestone: 1.2.0 Dec 3, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants