BUG: inconsistent handling of exact=False case in to_datetime parsing #50435

MarcoGorelli · 2022-12-25T11:06:42Z

closes BUG: inconsistent handling of exact=False case in to_datetime parsing #50412 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Haven't added a whatsnew note, as exact never worked to begin with for ISO8601 formats, and this just corrects #49333

pandas/_libs/tslibs/np_datetime.pxd

WillAyd · 2022-12-27T21:37:14Z

.pre-commit-config.yaml

            '--headers=h',
            --recursive,
-            '--filter=-readability/casting,-runtime/int,-build/include_subdir'
+            '--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size'


I think this is a good check to keep in place - otherwise these functions get unwieldy

Unfortunately the function is now 522 lines long, whereas the limit for this check is 500

Is it OK to turn it off now, or would you prefer a precursor PR to split up this function?

Hmm not a great solution here. I think OK for now but something we should take care of in a follow up.

Ideally you could change numpy upstream to split the function (maybe split into a date / time parsing functions?). That way we wouldn't diverge too far from them when we bring that downstream

OK I'll see if I can upstream something, thanks!

WillAyd · 2022-12-27T21:38:06Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.c

    while (sublen > 0 && isspace(*substr)) {
        ++substr;
        --sublen;
+        if (exact == PARTIAL_MATCH && !format_len) {


Can we not just make compare_format return a set of Enum depending on what is left in the string to consume and what the matching semantics are? Seems like it would naturally fit there rather than a separate branch every time

To clarify, I think you can return an enum from check_format of values like:

OK_EXACT OK_PARTIAL

etc... describing the different states, then branch in the caller appropriately

WillAyd · 2022-12-27T21:39:08Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.h

+ *      * NO_MATCH: don't require any match - parse without comparing
+ *                  with 'format'.
+ */
+enum Exact {


FYI this file is vendored from numpy. The ship has sailed a bit in terms of editing directly, but when we move to Meson and abandon setuptools its worth considering a split to put all of our custom logic into a separate library and leaving the vendored code in place (or upstreaming changes if they make sense for numpy)

MarcoGorelli

thanks for your review!

MarcoGorelli · 2022-12-28T10:57:16Z

.pre-commit-config.yaml

            '--headers=h',
            --recursive,
-            '--filter=-readability/casting,-runtime/int,-build/include_subdir'
+            '--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size'


Unfortunately the function is now 522 lines long, whereas the limit for this check is 500

Is it OK to turn it off now, or would you prefer a precursor PR to split up this function?

WillAyd · 2022-12-28T20:21:25Z

.pre-commit-config.yaml

            '--headers=h',
            --recursive,
-            '--filter=-readability/casting,-runtime/int,-build/include_subdir'
+            '--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size'


Hmm not a great solution here. I think OK for now but something we should take care of in a follow up.

Ideally you could change numpy upstream to split the function (maybe split into a date / time parsing functions?). That way we wouldn't diverge too far from them when we bring that downstream

WillAyd · 2022-12-28T20:22:30Z

pandas/_libs/tslibs/np_datetime.pxd

 ) except? -1
+
+cdef extern from "src/datetime/np_datetime_strings.h":
+    cdef enum Exact:


I think the name Exact is a little too vague - maybe better as DatetimeFormatRequirement?

Yes good call, I've gone with FormatRequirement to keep lines not-too-long

WillAyd · 2022-12-28T20:23:14Z

pandas/_libs/tslibs/np_datetime.pxd

+    cdef enum Exact:
+        PARTIAL_MATCH
+        EXACT_MATCH
+        NO_MATCH


Does NoMatch really mean that the format is inferred?

You're right, I've renamed to INFER_FORMAT, thanks!

WillAyd · 2022-12-28T20:26:45Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.c

 * Returns 0 on success, -1 on failure.
 */

+enum Comparison {


Design wise this assumes that the callee knows what the caller is doing and can instruct it on actions to take. I think it would be better to separate those entities and just have the callee report back what it knows.

With that in mind, maybe call rename this to DatetimePartParseResult and maybe have values of PARTIAL_MATCH, EXACT_MATCH, NO_MATCH. The caller can then choose to take action independent of this function

As in, to name the values the same way as those from FormatRequirement?

The issue is that different format requirements can result in the same result from this function - for example, both EXACT_MATCH where the format matches and INFER_FORMAT can return 0

I've renamed the values to

COMPARISON_SUCCESS, COMPLETED_PARTIAL_MATCH, COMPARISON_ERROR

, is that clearer?

WillAyd · 2022-12-28T20:27:53Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.c

+        int n,
+        const enum Exact exact
+) {
+  if (exact == PARTIAL_MATCH && !*characters_remaining) {


Suggested change

if (exact == PARTIAL_MATCH && !*characters_remaining) {

if (exact == PARTIAL_MATCH && *characters_remaining == 0) {

Nit but would be good to explicitly compare to 0. Depending on code structure we may also want to be careful what happens if characters_remaining somehow ends up as negative

WillAyd

looks pretty good. minor nits on typedefs otherwise lgtm

WillAyd · 2022-12-29T20:16:35Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.c

 * Returns 0 on success, -1 on failure.
 */

+enum DatetimePartParseResult {


If you use typedef here you don't need to repeat enum every time you refer to this type

nice, thanks!

WillAyd · 2022-12-29T20:18:32Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.h

+ *           be able to parse it without error is '%Y-%m-%d';
+ *      * INFER_FORMAT: parse without comparing 'format' (i.e. infer it).
+ */
+enum FormatRequirement {


should typedef here as well

WillAyd · 2022-12-29T21:33:46Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.c

+        int n,
+        const FormatRequirement format_requirement
+) {
+  if (format_requirement == PARTIAL_MATCH && !*characters_remaining) {


also a nit but I think we need to handle characters_remaining being negative. It could just simply return a COMPARISON_ERROR right?

Understood it is impossible in the current state of things. However, if this gets refactored in the future and a negative number makes its way in here uncaught I think it would return a COMPARISON_SUCCCESS and be very difficult to troubleshoot without intimate knowledge of this function

yup, thanks for this (and other) thoughtful comments!

also a nit but I think we need to handle characters_remaining being 0

I presume you meant "less than 0" - that's what I've gone for anyway

WillAyd

lgtm on green

MarcoGorelli force-pushed the exact-inconsistencies branch 2 times, most recently from b366e31 to ee7f95e Compare December 25, 2022 11:09

MarcoGorelli marked this pull request as ready for review December 25, 2022 12:15

fixup

3de2331

MarcoGorelli added Bug Datetime Datetime data dtype labels Dec 25, 2022

MarcoGorelli commented Dec 25, 2022

View reviewed changes

pandas/_libs/tslibs/np_datetime.pxd Outdated Show resolved Hide resolved

MarcoGorelli force-pushed the exact-inconsistencies branch from b0cd67c to 3de2331 Compare December 25, 2022 17:04

MarcoGorelli requested review from WillAyd and mroeschke December 26, 2022 15:15

mroeschke approved these changes Dec 27, 2022

View reviewed changes

mroeschke added this to the 2.0 milestone Dec 27, 2022

Merge remote-tracking branch 'upstream/main' into exact-inconsistencies

947353d

WillAyd reviewed Dec 27, 2022

View reviewed changes

MarcoGorelli added 2 commits December 28, 2022 10:51

use enum

e3fe55b

Merge remote-tracking branch 'upstream/main' into exact-inconsistencies

9e18d33

MarcoGorelli commented Dec 28, 2022

View reviewed changes

WillAyd requested changes Dec 28, 2022

View reviewed changes

MarcoGorelli added 9 commits December 29, 2022 09:52

more descriptive names

efeaf7a

renaming fixup

6c51924

cast

b96158f

clean up

0ebff5c

doc

caa9c90

correct syntax

84eeb3d

Merge remote-tracking branch 'upstream/main' into exact-inconsistencies

f92ff7a

use typedef

5c67ed3

check for negative characters remaining

bad704e

WillAyd reviewed Dec 29, 2022

View reviewed changes

WillAyd requested changes Dec 29, 2022

View reviewed changes

WillAyd approved these changes Dec 29, 2022

View reviewed changes

MarcoGorelli added 2 commits December 30, 2022 14:43

Merge remote-tracking branch 'upstream/main' into exact-inconsistencies

ec6591b

reduce diff

8d8f90e

MarcoGorelli merged commit a28cadb into pandas-dev:main Dec 31, 2022

	if (exact == PARTIAL_MATCH && !*characters_remaining) {
	if (exact == PARTIAL_MATCH && *characters_remaining == 0) {

Uh oh!

BUG: inconsistent handling of exact=False case in to_datetime parsing #50435

BUG: inconsistent handling of exact=False case in to_datetime parsing #50435

Uh oh!

Conversation

MarcoGorelli commented Dec 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd Dec 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MarcoGorelli commented Dec 25, 2022 •

edited

Loading

WillAyd Dec 29, 2022 •

edited

Loading