-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-46589: [C++] Fix utf8_is_digit to support full Unicode digit range #46590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
kou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6fa58c2 to
5cbabcf
Compare
ce442e7 to
47e2683
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the update @iabhi4 ! This LGTM, I'll just wait for CI.
47e2683 to
ce92151
Compare
|
I pushed a quick fix for the C++ lint failures, will merge. |
|
After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit dc0f5a9. None of the specified runs were found on the Conbench server. The full Conbench report has more details. |
Rationale for this change
pyarrow.compute.utf8_is_digitdid not recognize valid Unicode digit characters (e.g., superscripts like'³'), diverging from the behavior of Python's built-instr.isdigit()This caused inconsistencies in downstream libraries like pandas when using PyArrow-backed StringDtype.
What changes are included in this PR?
Updated
IsDigitCharacterUnicodeimplementation to cover a broader range of Unicode digits by replacing category check with one that aligns with Python’sstr.isdigit()semantics.Added tests in
scalar_string_test.ccto validate correct digit detection across diverse Unicode digit inputs.Are these changes tested?
Yes. New unit tests were added and pass successfully, verifying behavior on various Unicode digit characters.
Are there any user-facing changes?
Yes, users relying on
pc.utf8_is_digit()will now get correct results for a wider range of Unicode digit characters, improving correctness and parity with Python semanticsutf8_is_digitin PyArrow doesn't fully match Python'sstr.isdigit()(e.g., fails for '³') #46589