-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG: Iteration over DatetimeIndex stops at chunksize (GH21012) #21027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21027 +/- ##
==========================================
- Coverage 91.82% 91.82% -0.01%
==========================================
Files 153 153
Lines 49502 49523 +21
==========================================
+ Hits 45457 45475 +18
- Misses 4045 4048 +3
Continue to review full report at Codecov.
|
|
Couple of comments:
|
pandas/core/indexes/datetimes.py
Outdated
|
|
||
| class Iterator: | ||
| def __init__(self, index): | ||
| self.data = index.asi8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have a BaseIterator class in pandas.io.common, it would need to be moved to pandas.core.common to use it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but why are you creating a class for this anyhow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I'd prefer to keep this in DatetimeIndex.__iter__ unless it's going to be used elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If DatetimeIndex class has __next__ method and __iter__ returns itself, some other codes see DatetimeIndex as a iterator rather than a sequence. This cause some failures (#21012). So I wanted to hide __next__ somehow.
I found BaseIterator. But with it, it would become complicated because we have to make a subclass or override methods of a instance of BaseIterator (is this correct?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that most of the state tracking can be done inside DatetimeIndex.__iter__. Unless I'm misunderstanding, there's no need or desire to add DatetimeIndex.__next__.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... sorry, but I can't come up with the way using only __iter__ and avoiding yield (#18848).
TomAugspurger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also need a release note (for 0.23.0)
| index = date_range('2000-01-01 00:00:00', periods=periods, freq='min') | ||
| num = 0 | ||
| for i, d in enumerate(index): | ||
| assert index[num] == d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use i instead of num?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to check boundary values including periods == 0.
Without num, it would be a bit complicated, isn't it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine then, though you can remove the enumerate since i isn't used anywhere I think.
pandas/core/indexes/datetimes.py
Outdated
|
|
||
| class Iterator: | ||
| def __init__(self, index): | ||
| self.data = index.asi8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I'd prefer to keep this in DatetimeIndex.__iter__ unless it's going to be used elsewhere.
|
I'm not sure where to write a whatsnew (this is not a bug in 0.22.0). |
|
Good point on the whatsnew. I suppose it's not necessary in this case. |
|
Yes, whatsnew is not necessary as it was not a bug before. IMO, this PR adds a lot of complexity, but probably necessary if we want to keep both:
I don't know how important either of those are (eg couldn't this be fixed in dill itself? Other pickle libraries have no problem with it?), so difficult to assess for me. |
|
An alternative is the original fix proposed by @kittoku which entails making |
|
So there are 4 options from what I see now. 1 : 2 : 3 : 4 : I'm not in need to pickle by dill, so option 2 is best to me. I thought option 3 was the second. But now option 4 looks not bad because an idea of DatetimeIndex as iterator seems not so strange. |
|
sure this is a bug, but not a blocker for 0.23.0 |
|
It is odd of me to say that, but could we discard my PR and cancel a commit of #18848 (take option 2)? I can't find a reason to pickle |
|
@kittoku that's a good point. We don't even have a test for that , and to be honest most folks use |
|
May I leave the test? |
|
yes test is ok |
|
thanks @kittoku yeah the original (not-released) patched wasn't doc-ced / tested, and unclear how this actually solves a problem, while breaking this. |
git diff upstream/master -u -- "*.py" | flake8 --diffavoided using
yieldand changing DatetimeIndex itself into a iterator. @cbertinato 's advice was helpful.