-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG: Avoid flaky usecols set in C engine #14984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current coverage is 84.78% (diff: 100%)@@ master #14984 diff @@
==========================================
Files 145 145
Lines 51090 51094 +4
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43315 43319 +4
Misses 7775 7775
Partials 0 0
|
|
|
||
| return set(usecols) | ||
| return usecols | ||
| return set(usecols), usecols_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this still have a set conversion here? is this for a different case? if so, then should happen in your validation below (to make it more consistent), IOW, this always returns tuple of (list of usecols, dtype of usecols). Please make the doc-string indicate that as well (t he doc-string for this function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I fully follow your comment here. The reason I return set is for the Python engine. I can clarify the doc-string to return set (or None) along the dtype of usecols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more clear to have _validate_use_cols always return (list, dtype) (for all engines). It makes it very clear what it does. Then have it convert to a set elsewhere (as you do in _set_noconvert_columns. otherwise you have 2 places which do this conversion (from list -> set), depending on the engine. again confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's always set for both engines. This function returns either (independent of engine):
setNonecallable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : I updated the documentation to illustrate the return type is independent of engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still, why are we returning a set here? why not a list? (for this same case)? then you don't have to rely on the ordering and convert to a list if its integer type?
could just do this: http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-whilst-preserving-order
then make it into a set at the last moment.
I am nitpicking on this because converting this to a set is happening too early, and it is just confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early conversion to set has been used for quite some time, especially with the C engine. Also, a list is not necessarily going to help with ordering I might add:
>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = 'a,b,c\n1,2,3'
>>> read_csv(StringIO(data), usecols=[2, 1])
b c
0 2 3When I say "order," it is not about maintaining the order in which usecols is passed in but about ordering usecols correctly so that the names or indices are in the same order as they are listed in the data columns. That is why converting at the last moment does not present any benefit AFAICT.
30bae02 to
6ac5814
Compare
| if self.usecols_dtype == 'integer': | ||
| # A set of integers will be converted to a list in | ||
| # the correct order every single time. | ||
| usecols = list(self.usecols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is very confusing here. You are now converting to a list. So usecols ends up as a list, None or a callable? when / how does it end up as a set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a short docstring for this _set_noconvert_columns function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : During __init__, both functions call _validate_usecols_arg when setting self.usecols, which returns either a callable, set, or None.
@jorisvandenbossche : Added docstring.
6ac5814 to
82cf55b
Compare
|
@jreback , @jorisvandenbossche : Addressed comments, and everything is green. |
|
ok thanks @gfyoung |
Explanation of the bug can be found here.
Closes #14792.