Optimize Enumerable.Skip() for IList<> parameter by scalablecory · Pull Request #4551 · dotnet/corefx

scalablecory · 2015-11-17T20:56:36Z

Changes Enumerable.Skip() from O(n) to O(1) when an IList<> is passed.

stephentoub · 2015-11-17T21:56:55Z

It'd be good to cache source.Count into a "local"; otherwise this will incur an interface call on each iteration. The extra field for that localon the display class is likely a good tradeoff.

Since there was contention about saving the Count value to a local (yes: save on interface invocation, no: what if the list was mutated) I think we should have a test that fires one of these up, iterates a bit, removes some data from the original IList value, and continues iterating. That way the most-recently agreed-upon behavior is codified in a test that goes in along with the change.

stephentoub · 2015-11-17T22:06:07Z

One comment, but otherwise the src changes LGTM.

We likely need some more tests. The code coverage report shows the new code being covered, but from looking at the existing test cases I think that might be misleading. We should ensure that we're examining all of the same inputs / cases for both the enumerable and list inputs.

cc: @VSadov, @JonHanna

JonHanna · 2015-11-17T22:59:09Z

It might be worth measuring the times on arrays, as iterating through them via IList is relatively expensive.

Potentially, this could work well in combination with the IPartition interface of #2401 which would allow for combinations of skips and takes to fare well, though there's no reason why this should wait on that.

JonHanna · 2015-11-17T23:16:10Z

There would be changed behaviour in terms of if changes to the source happened within the enumeration of the items. I'm not sure it's an issue, as it would make code that currently throws work, rather than the other way around (and I've personally never agreed with enumerators explicitly banning such changes anyway), but it's worth noting that it's not an unobservable change in that regard at least.

JonHanna · 2015-11-18T00:20:25Z

I know this contradicts what @stephentoub said in recommending this a commit ago, but all of the cases where calls to Count are currently pushed to a local are true locals and used to produce a result immediately rather than captured and used in an iterator. This means the only way it can race against something on the same thread is relatively obscure happening in a Func. In this case there could be something affecting Count within a foreach on the results of the skip. It might be more conservative to keep calling Count.

Yes, agree on this. Sideeffecting funcs are discouraged, but cannot be prevented.
When replacing iteration with indexing, we need to re-check Count to preserve existing behavior or we could potentially break some previously-working code.

Sounds like we're leaning toward not caching the count. Someone give a definitive word on this and I'll revert the commit. @stephentoub, any objection?

@stephentoub, any objection?

No objection, thanks. My suggestion was based on performance, but that loses out to correctness. The only way it would be incorrect is in a situation where there's already invalid usage, but I'm fine with the argument that we want to be as correct as possible even in that situation, given some definition of "correct" (there will still be oddities, and things that may be considered incorrect).

JonHanna · 2015-11-18T00:31:08Z

Two concerns, but I like it in principle. I think as well as tests it would be good to have some performance information. I suspect there's a trade-off here with most source IList implementations (especially arrays, and most of all arrays being used with array variance, but the others as well) so that there's a threshold for count as a percentage of the number of items actually enumerated below which this is a nett loss. Proving that suspicion wrong would of course be fantastic, but otherwise having an idea of just what sort of numbers gain and which lose would be good.

ikopylov · 2015-11-18T16:48:01Z

This PR changes the behavior of the following code:

var list = new List<int>() { 1, 2, 3 };
foreach (var item in list.Skip(1))
    list.Add(item);

Original implementation throws InvalidOperationException due to the modification of the collection. New implementation do not (or even throws ArgumentOutOfRangeException when Remove used).

stephentoub · 2015-11-18T16:51:10Z

@ikopylov (and @JonHanna, who also mentioned it in a comment earlier), it does, but we've also been adding such special cases elsewhere in LINQ, in fact @JonHanna I believe you've added some of them 😉. If we're concerned about it in this case, we've got a lot of changes to go back and revert. I believe we agreed this was an acceptable difference. @VSadov? @weshaggard?

ikopylov · 2015-11-18T17:00:54Z

I believe we agreed this was an acceptable difference.

That's great.

However source.Count should not be cached to avoid ArgumentOutOfRangeException.

stephentoub · 2015-11-18T17:06:20Z

However source.Count should not be cached to avoid ArgumentOutOfRangeException.

I'm ok with that, but to be clear, changing the collection is going to potentially result in very strange behavior, even without that. For example, if an item is added to the beginning of the list between two iterations, this change will likely result in the same item being returned twice.

JonHanna · 2015-11-18T17:26:14Z

I'm pretty sure I've avoided that particular type of change (I might be wrong), but in any case I think this observable change is worth:

noting
considering carefully
and then ultimately deciding as okay.

Step 2 of course entails I'm open to persuasion on step 3 :) Still, I never did agree with that InvalidOperationException in the first place, so I'm not against having it bypassed.

I do agree that Count shouldn't be cached. As I said above, the only places we currently do this can only (in terms of a single thread, anyway) have it altered by strange things in Funcs hit by the query. Here it would still take some relatively strange code to hit it, but not as strange (especially if one follows the idea that Linq Funcs should be pure functions).

Most of all I wonder about the performance. The constant cost of the test I think is negligible (now there's something I certainly have added several times myself), but the impact of the call to IList's indexer could undo the benefits.

stephentoub · 2015-11-18T17:29:01Z

I'm pretty sure I've avoided that particular type of change

Hmm, ok, we'll need to go back and look. I thought we'd already discussed such a change in the context of other ones being made previously. If that's not the case, then we need to put the brakes on this change and address that issue before moving forward with this.

VSadov · 2015-11-18T17:48:49Z

About throwing IOE when collection changes while iterating. IMO this is a fairly useless behavior and I have not seen a single case of code depending on this. In particular, the behavior composes very unreliably. Example: trivial concatenation of two lists via Linq will result in something that will guard against modifications, but only when you are enumerating the matching half ...
Honestly, I think if this kind of checks were removed everywhere, nobody would notice.
I think this part of the change is ok.

JonHanna · 2015-11-18T17:55:02Z

Honestly, I think if this kind of checks were removed everywhere, nobody would notice.

I would. I'd open a particularly nice bottle and toast whoever made that change that I've been wanting to see for over a decade ;)

svick · 2015-11-18T17:59:16Z

What about ImmutableList<T>? It implements IList<T>, so this change would apply to it, but its indexer seems to be O(log n) (unfortunately this is not documented).

This means that iterating immutableList.Skip(k) is O(n) with the old implementation, but O((n-k) * log n) with the proposed change. So, it's likely that this change would actually decrease performance for ImmutableList<T>, especially for large n and small k. (Though I haven't actually performed any benchmarks.)

svick · 2015-11-18T18:08:18Z

@VSadov

About throwing IOE when collection changes while iterating. IMO this is a fairly useless behavior and I have not seen a single case of code depending on this.

I think that's because it's a boneheaded exception: no production code will rely on it, but it is still useful, because it tells developers when they made an error.

Maybe it would make sense to find some other, more reliable, way to tell developers about that error, like a Roslyn analyzer?

stephentoub · 2015-11-18T18:09:16Z

We could special case T[] rather than IList<T> and avoid both of these issues.

Separately, though, it seems unfortunate if ImmutableList<T>'s indexer is O(log n). If that is the case, it is unexpected, and certainly not a pit of success, as, separate from this change, common code for walking each item in the list via for will end up being O(n log n) rather than O(n). @AArnott, thoughts?

scalablecory · 2015-11-18T18:40:05Z

@JonHanna Benchmark results show this is a clear winner for arrays and lists regardless of skip length.

Other optimizations that could be made: ToArray/ToList speedup (not sure how often Skip is piped to those, I can't say I've ever done it, so I left it out), Skip(<=0) simply returning the passed in collection unchanged (again can't say I've ever done this).

AArnott · 2015-11-19T03:21:37Z

@stephentoub said:

it seems unfortunate if ImmutableList 's indexer is O(log n)

That is in fact the case. ImmutableList is internally a binary tree, so indexing into it is O(log n). So indexing every item is n log n, whereas a straight up enumeration of it using its enumerator is simply n.

JonHanna · 2015-11-19T11:01:29Z

@scalablecory

Benchmark results show this is a clear winner for arrays and lists regardless of skip length.

I also got reassuring results trying the following, which isn't extensive but does hit a case deliberately engineered to be a case that most hits an expected worse-case:

[Fact]
public void QuickSpeedTest()
{
    IEnumerable<object> source = new string[100000];
    var sw = Stopwatch.StartNew();
    for(int i = 0; i != 10000; ++i)
        foreach(var item in source.Skip(1))
        {
        }
    sw.Stop();
    Assert.Equal(0, sw.ElapsedMilliseconds); // Just want this output
}

It came in at a very slight gain even with the unsafe memoising of Count undone.

I'm satisfied about the case I was most worried about.

ImmutableList I think is less of a worry. It could be easily enough given its own optimisation:

ImmutableList<TSource> immute = source as ImmutableList<TSource>
if (immute != null)
{
    int newLen = immute.Count - count;
    if (newLen <= 0) return Enumerable.Empty<TSource>();
    return immute.GetRange(count, newLen);
}

But apart from this needing to create a dependency on System.Collections.Immutable, I think the case where ImmutableList is being used and the code has to be agnostic to the fact that it is being used (rather than just use GetRange instead of Linq because the coder knows they're using immutable structures) is likely relatively obscure. A lot of things have very different performance behaviour with immutable collections, which is one of the reasons to use them (since that difference is often for the better) and they're used with those differences in mind. There might be value in an extension within Collections.Immutable which did Skip() typed to ImmutableList<T> and/or one that provided a wrapper enumerable for passing to code (not just limited to Linq) where those performance differences would hurt more than help.

Other optimizations that could be made: ToArray/ToList speedup (not sure how often Skip is piped to those, I can't say I've ever done it, so I left it out).

I've certainly piped Skip(pageNum * PageSize).Take(PageSize) to those, but not skip. If you wanted to go that far, then I'd suggest optimising Take on the result too, and hence covering that case as well. (Again #2401 has an interface for a type that can be optimised for both Skip and Take, so if that's accepted it could be a good basis for it).

…Skip(<=0) simply returning the passed in collection unchanged (again can't say I've ever done this).

I'd say its relatively common, when the value passed to Skip() is a variable that could be 0. That said, while I would really like to do this (along with a few other cases where we could avoid allocation by returning an existing object), it would result in a strong change on the observable behaviour. Indeed, I remember @VSadov saying he's seen people call .Skip(0) purely as a means to quickly get an object which enumerates the same item but is not the same collection unchanged.

In all, this one commit ago (before the caching of Count) LGTM.

JonHanna · 2015-11-19T12:34:37Z

We likely need some more tests. The code coverage report shows the new code being covered, but from looking at the existing test cases I think that might be misleading.

Yes, the current tests are a mixture of some I created, which mostly use Enumerable.Range() and those imported from the legacy tests that mostly use arrays. They are checking on slightly different things. They each need to be changed to have both a test with guaranteed non-list (Enumerable.Range() doesn't absolutely guarantee that, and treating it as a list is a possible future optimisation, but NumberRangeGuaranteedNotCollectionType() does guarantee that) and guaranteed IList.

This reverts commit c9953f4.

scalablecory · 2015-11-23T17:46:13Z

Count memoization reverted, should be good to go.

JonHanna · 2015-11-23T19:05:06Z

All the tests that hit the old iterator should ideally be doubled up so that there is a version for both guaranteed list source and guaranteed non-list source.

VSadov · 2015-12-16T23:05:30Z

LGTM

Optimize Enumerable.Skip() for IList<> parameter

dotnet#4551 introduced optimised versions of Skip for IList<T> sources. Have all tests for Skip test both this and the previous path.

Optimisation of Skip() for IList sources from dotnet#4551 fits with optimisations of Skip() and Take() for other sources from dotnet#2401. Combine the approaches, extending how the result of Skip() on a list handles subsequent operations.

Anything that can serve as one can serve as the other, and also provide a faster path for Count(). Merge the two interfaces and add a Count property. Have IList optimised result of Skip() partitionable. Optimisation of Skip() for IList sources from dotnet#4551 fits with optimisations of Skip() and Take() for other sources from dotnet#2401. Combine the approaches, extending how the result of Skip() on a list handles subsequent operations.

Anything that can serve as one can serve as the other, and also provide a faster path for Count(). Merge the two interfaces and add a Count property. Have IList optimised result of Skip() partitionable. Optimisation of Skip() for IList sources from dotnet/corefx#4551 fits with optimisations of Skip() and Take() for other sources from dotnet/corefx#2401. Combine the approaches, extending how the result of Skip() on a list handles subsequent operations. Commit migrated from dotnet/corefx@a087c2d

Optimize Enumerable.Skip for IList.

5a72312

dnfclas added the cla-already-signed label Nov 17, 2015

joshfree removed the cla-already-signed label Nov 17, 2015

stephentoub reviewed Nov 17, 2015
View reviewed changes

Avoid repeated interface call as suggested by @stephentoub.

c9953f4

JonHanna reviewed Nov 18, 2015
View reviewed changes

Revert "Avoid repeated interface call as suggested by @stephentoub."

7d775ac

This reverts commit c9953f4.

stephentoub assigned VSadov Dec 2, 2015

VSadov added a commit that referenced this pull request Jan 4, 2016

Merge pull request #4551 from scalablecory/linq-skip-optimization

e9f2377

Optimize Enumerable.Skip() for IList<> parameter

VSadov merged commit e9f2377 into dotnet:master Jan 4, 2016

JonHanna mentioned this pull request Jan 10, 2016

Have IList and non-IList versions of all Enumerable.Skip tests. #5280

Merged

ikopylov mentioned this pull request Jan 11, 2016

System.Linq: Genericize ToDictionary optimizations for IList<T> #5261

Closed

JonHanna mentioned this pull request Jan 16, 2016

Combine Linq optimisations [WIP] #5486

Closed

JonHanna mentioned this pull request Jan 29, 2016

Combine linq list optimisations #5777

Merged

stephentoub added the netfx-port-consider label Apr 13, 2016

karelz modified the milestone: 1.0.0-rtm Dec 3, 2016

Conversation

scalablecory commented Nov 17, 2015

Uh oh!

stephentoub Nov 17, 2015

Choose a reason for hiding this comment

Uh oh!

bartonjs Jan 4, 2016

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Nov 17, 2015

Uh oh!

JonHanna commented Nov 17, 2015

Uh oh!

JonHanna commented Nov 17, 2015

Uh oh!

JonHanna Nov 18, 2015

Choose a reason for hiding this comment

Uh oh!

VSadov Nov 18, 2015

Choose a reason for hiding this comment

Uh oh!

scalablecory Nov 19, 2015

Choose a reason for hiding this comment

Uh oh!

stephentoub Nov 19, 2015

Choose a reason for hiding this comment

Uh oh!

JonHanna commented Nov 18, 2015

Uh oh!

ikopylov commented Nov 18, 2015

Uh oh!

stephentoub commented Nov 18, 2015

Uh oh!

ikopylov commented Nov 18, 2015

Uh oh!

stephentoub commented Nov 18, 2015

Uh oh!

JonHanna commented Nov 18, 2015

Uh oh!

stephentoub commented Nov 18, 2015

Uh oh!

VSadov commented Nov 18, 2015

Uh oh!

JonHanna commented Nov 18, 2015

Uh oh!

svick commented Nov 18, 2015

Uh oh!

svick commented Nov 18, 2015

Uh oh!

stephentoub commented Nov 18, 2015

Uh oh!

scalablecory commented Nov 18, 2015

Uh oh!

AArnott commented Nov 19, 2015

Uh oh!

JonHanna commented Nov 19, 2015

Uh oh!

JonHanna commented Nov 19, 2015

Uh oh!

scalablecory commented Nov 23, 2015

Uh oh!

JonHanna commented Nov 23, 2015

Uh oh!

VSadov commented Dec 16, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants