-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
BUG: inconsistent handling of exact=False case in to_datetime parsing #50435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: inconsistent handling of exact=False case in to_datetime parsing #50435
Conversation
b366e31 to
ee7f95e
Compare
b0cd67c to
3de2331
Compare
| '--headers=h', | ||
| --recursive, | ||
| '--filter=-readability/casting,-runtime/int,-build/include_subdir' | ||
| '--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good check to keep in place - otherwise these functions get unwieldy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the function is now 522 lines long, whereas the limit for this check is 500
Is it OK to turn it off now, or would you prefer a precursor PR to split up this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not a great solution here. I think OK for now but something we should take care of in a follow up.
Ideally you could change numpy upstream to split the function (maybe split into a date / time parsing functions?). That way we wouldn't diverge too far from them when we bring that downstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I'll see if I can upstream something, thanks!
| while (sublen > 0 && isspace(*substr)) { | ||
| ++substr; | ||
| --sublen; | ||
| if (exact == PARTIAL_MATCH && !format_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not just make compare_format return a set of Enum depending on what is left in the string to consume and what the matching semantics are? Seems like it would naturally fit there rather than a separate branch every time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I think you can return an enum from check_format of values like:
OK_EXACT
OK_PARTIALetc... describing the different states, then branch in the caller appropriately
| * * NO_MATCH: don't require any match - parse without comparing | ||
| * with 'format'. | ||
| */ | ||
| enum Exact { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this file is vendored from numpy. The ship has sailed a bit in terms of editing directly, but when we move to Meson and abandon setuptools its worth considering a split to put all of our custom logic into a separate library and leaving the vendored code in place (or upstreaming changes if they make sense for numpy)
MarcoGorelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your review!
| '--headers=h', | ||
| --recursive, | ||
| '--filter=-readability/casting,-runtime/int,-build/include_subdir' | ||
| '--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the function is now 522 lines long, whereas the limit for this check is 500
Is it OK to turn it off now, or would you prefer a precursor PR to split up this function?
| '--headers=h', | ||
| --recursive, | ||
| '--filter=-readability/casting,-runtime/int,-build/include_subdir' | ||
| '--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not a great solution here. I think OK for now but something we should take care of in a follow up.
Ideally you could change numpy upstream to split the function (maybe split into a date / time parsing functions?). That way we wouldn't diverge too far from them when we bring that downstream
pandas/_libs/tslibs/np_datetime.pxd
Outdated
| ) except? -1 | ||
|
|
||
| cdef extern from "src/datetime/np_datetime_strings.h": | ||
| cdef enum Exact: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name Exact is a little too vague - maybe better as DatetimeFormatRequirement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good call, I've gone with FormatRequirement to keep lines not-too-long
pandas/_libs/tslibs/np_datetime.pxd
Outdated
| cdef enum Exact: | ||
| PARTIAL_MATCH | ||
| EXACT_MATCH | ||
| NO_MATCH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does NoMatch really mean that the format is inferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I've renamed to INFER_FORMAT, thanks!
| * Returns 0 on success, -1 on failure. | ||
| */ | ||
|
|
||
| enum Comparison { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design wise this assumes that the callee knows what the caller is doing and can instruct it on actions to take. I think it would be better to separate those entities and just have the callee report back what it knows.
With that in mind, maybe call rename this to DatetimePartParseResult and maybe have values of PARTIAL_MATCH, EXACT_MATCH, NO_MATCH. The caller can then choose to take action independent of this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, to name the values the same way as those from FormatRequirement?
The issue is that different format requirements can result in the same result from this function - for example, both EXACT_MATCH where the format matches and INFER_FORMAT can return 0
I've renamed the values to
COMPARISON_SUCCESS,
COMPLETED_PARTIAL_MATCH,
COMPARISON_ERROR
, is that clearer?
| int n, | ||
| const enum Exact exact | ||
| ) { | ||
| if (exact == PARTIAL_MATCH && !*characters_remaining) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (exact == PARTIAL_MATCH && !*characters_remaining) { | |
| if (exact == PARTIAL_MATCH && *characters_remaining == 0) { |
Nit but would be good to explicitly compare to 0. Depending on code structure we may also want to be careful what happens if characters_remaining somehow ends up as negative
WillAyd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good. minor nits on typedefs otherwise lgtm
| * Returns 0 on success, -1 on failure. | ||
| */ | ||
|
|
||
| enum DatetimePartParseResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use typedef here you don't need to repeat enum every time you refer to this type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, thanks!
| * be able to parse it without error is '%Y-%m-%d'; | ||
| * * INFER_FORMAT: parse without comparing 'format' (i.e. infer it). | ||
| */ | ||
| enum FormatRequirement { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should typedef here as well
| int n, | ||
| const FormatRequirement format_requirement | ||
| ) { | ||
| if (format_requirement == PARTIAL_MATCH && !*characters_remaining) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also a nit but I think we need to handle characters_remaining being negative. It could just simply return a COMPARISON_ERROR right?
Understood it is impossible in the current state of things. However, if this gets refactored in the future and a negative number makes its way in here uncaught I think it would return a COMPARISON_SUCCCESS and be very difficult to troubleshoot without intimate knowledge of this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, thanks for this (and other) thoughtful comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also a nit but I think we need to handle characters_remaining being 0
I presume you meant "less than 0" - that's what I've gone for anyway
WillAyd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm on green
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.Haven't added a whatsnew note, as
exactnever worked to begin with for ISO8601 formats, and this just corrects #49333