Recognize Skip/Take chains on lazy sequences.#13628
Recognize Skip/Take chains on lazy sequences.#13628stephentoub merged 3 commits intodotnet:masterfrom
Conversation
There was a problem hiding this comment.
Note: This can overflow if _maxIndexInclusive is int.MaxValue, and _minIndexInclusive is 0 (which happens for e.Skip(0)). It doesn't matter though, since we decrement the value right away.
There was a problem hiding this comment.
Please add a comment about that.
There was a problem hiding this comment.
I would have avoided optimizing the TryGet* functions if I could have, since they take a lot of code and are not as valuable to optimize as the other functions. But, currently they come with implementing IPartition.
|
The combination |
There was a problem hiding this comment.
Doesn't using _state like this cause a problem for really long enumerables with > int.MaxValue - 3 elements?
There was a problem hiding this comment.
Will move to long.
There was a problem hiding this comment.
Please make sure tests are added to cover all of these various branches, here and elsewhere.
There was a problem hiding this comment.
Can this whole method just be:
private bool SkipBeforeFirst(IEnumerator<TSource> en) => TakeBefore(_minIndexInclusive, en);?
There was a problem hiding this comment.
@stephentoub Good idea. In fact, I renamed the method to be SkipBefore and had TryGetElementAt do SkipBefore(_minIndexInclusive + index, en). Should be slightly more efficient.
There was a problem hiding this comment.
Doesn't this end up limiting enumerables to those with int.MaxValue elements? That limitation was not there before.
There was a problem hiding this comment.
I can change this to use long instead. It's unlikely that an enumerable that comes close to long.MaxValue elements can be iterated in a reasonable amount of time.
|
Marking as no merge temporarily. Since long fields are now being used, which takes up an extra 16 bytes, I'll have to re-measure the memory impact this change has. (Also, there's some more refactoring I'd like to do.) |
|
@stephentoub, I managed to refactor the iterator to support enumerables > int.MaxValue, while still using @JonHanna, would appreciate if you could review as well. I ran into a couple of subtle integer overflow bugs whilst working on this PR, so the more eyes the better. |
There was a problem hiding this comment.
We're dishing out an extra unnecessary write/branch per iteration for the loop if HasLimit is false (e.g. if only Skip has been called, and not Take). It could be specialized like this:
if (!HasLimit)
{
do { builder.Add(en.Current); } while (en.MoveNext());
}
else
{
do { remaining--; builder.Add(en.Current); } while (remaining >= 0 && en.MoveNext());
}I figured it wasn't worth the extra code though.
There was a problem hiding this comment.
Had to add a uint overload of this function to support scenarios where _maxIndexInclusive is exactly int.MaxValue & we call Count(), for example e.Skip(1).Take(int.MaxValue).Count(). We don't lose any validation at callsites other than GetCount though, since the signed wrapper asserts.
There was a problem hiding this comment.
Interesting choice to represent bounds.
Why not uint _start and uint _length ? I think the math might get simpler.
You can still use uint.MaxValue as an "unlimited" marker.
There was a problem hiding this comment.
@VSadov I originally had something like that, actually, using ints. But since all of the other partitions used _minIndexInclusive and _maxIndexInclusive I decided to follow their lead.
Why not uint _start and uint _length? I think the math might get simpler.
I used ints for the fields, since it's more idiomatic. If we know the length (Take has been called atl. once) then the count will fit in an int anyway. Also, it'd still be possible that _start could overflow during a chained Skip / Take & we'd have to wrap ourselves in another iterator.
|
@stephentoub I updated this PR to match the changes in #14020. Can you please review when you have time? Thanks. |
|
LGTM |
There was a problem hiding this comment.
This value is only meaningful if HasLimit is true, right? Should we add a Debug.Assert(HasLimit) to the getter?
There was a problem hiding this comment.
How expensive would it be to have an outerloop test that overflowed to test this condition? Probably prohibitive?
There was a problem hiding this comment.
Well it would be 2B * 3 virtual method calls (MoveNext, Current, and MoveNext on the iterator) plus all of the other logic, so yes, probably too expensive for a test.
There was a problem hiding this comment.
Yeah. In a few places in our tests we do use reflection to put the object into a state that would make it hit such cases quickly, e.g.
https://github.com/dotnet/corefx/blob/master/src/System.Runtime.InteropServices/tests/System/Runtime/InteropServices/HandleCollectorTests.cs#L63
Would it be worth doing something like that here? Or we don't think there's enough risk to warrant that kind of fragility?
There was a problem hiding this comment.
Would it be worth doing something like that here? Or we don't think there's enough risk to warrant that kind of fragility?
It is low risk, but at the same time I think it is unlikely the field name of _state will change. I will add a test.
There was a problem hiding this comment.
I see, this is why you don't assert HasLimit in Limit's getter.
There was a problem hiding this comment.
Nit: looking at a code coverage report, this is never hit. Missing a test?
There was a problem hiding this comment.
@stephentoub Made the change. Something to note: While adding coverage for this, I also realized I had forgotten to override Select in this type. (Select checks for an Iterator first, then an IPartition, and the default implementation of Iterator.Select is to return a SelectEnumerableIterator rather than a SelectIPartitionIterator. Thus it's necessary to override Select here to return the latter for better perf.) So thank you for pointing that out.
|
Thanks, @jamesqo. LGTM. |
I added a new
EnumerablePartition<TSource>class that lets us now recognize patterns likeTake.Skip, or vice versa, for lazy sequences. We also remove a layer of indirection from some common operations following these methods, e.g.Take.ToArray,Skip.ToArray, etc.Perf results can be found here; test code here. There is a ~20% speedup that results from removing an indirection layer from the chain.
cc @JonHanna, @VSadov, @stephentoub