Skip to content

Conversation

@iabhi4
Copy link
Contributor

@iabhi4 iabhi4 commented May 26, 2025

Rationale for this change

pyarrow.compute.utf8_is_digit did not recognize valid Unicode digit characters (e.g., superscripts like '³'), diverging from the behavior of Python's built-in str.isdigit()
This caused inconsistencies in downstream libraries like pandas when using PyArrow-backed StringDtype.

What changes are included in this PR?

Updated IsDigitCharacterUnicode implementation to cover a broader range of Unicode digits by replacing category check with one that aligns with Python’s str.isdigit() semantics.

Added tests in scalar_string_test.cc to validate correct digit detection across diverse Unicode digit inputs.

Are these changes tested?

Yes. New unit tests were added and pass successfully, verifying behavior on various Unicode digit characters.

Are there any user-facing changes?

Yes, users relying on pc.utf8_is_digit() will now get correct results for a wider range of Unicode digit characters, improving correctness and parity with Python semantics

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@kou kou changed the title ARROW-46589: Fixed utf8_is_digit to support full Unicode digit range GH-46589: [C++] Fixed utf8_is_digit to support full Unicode digit range May 27, 2025
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou Could you take a look at this?

NOTE: utf8_is_digit() was introduced by #7656.

@iabhi4 iabhi4 force-pushed the fix-utf8-is-digit-46589 branch from 6fa58c2 to 5cbabcf Compare May 27, 2025 01:37
@iabhi4 iabhi4 requested a review from kou May 27, 2025 01:40
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 27, 2025
@iabhi4 iabhi4 force-pushed the fix-utf8-is-digit-46589 branch 2 times, most recently from ce442e7 to 47e2683 Compare May 28, 2025 17:49
@iabhi4 iabhi4 requested a review from pitrou May 28, 2025 17:50
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the update @iabhi4 ! This LGTM, I'll just wait for CI.

@pitrou pitrou force-pushed the fix-utf8-is-digit-46589 branch from 47e2683 to ce92151 Compare June 2, 2025 10:22
@pitrou
Copy link
Member

pitrou commented Jun 2, 2025

I pushed a quick fix for the C++ lint failures, will merge.

@pitrou pitrou changed the title GH-46589: [C++] Fixed utf8_is_digit to support full Unicode digit range GH-46589: [C++] Fix utf8_is_digit to support full Unicode digit range Jun 2, 2025
@pitrou pitrou merged commit dc0f5a9 into apache:main Jun 2, 2025
36 of 38 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jun 2, 2025
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit dc0f5a9.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants