-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
gh-74902: Add Unicode Grapheme Cluster Break algorithm #143076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
serhiy-storchaka
wants to merge
4
commits into
python:main
Choose a base branch
from
serhiy-storchaka:grapheme_cluster_break2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -184,6 +184,28 @@ following functions: | |
| '0041 0303' | ||
|
|
||
|
|
||
| .. function:: grapheme_cluster_break(chr, /) | ||
|
|
||
| Returns the Grapheme_Cluster_Break property assigned to the character. | ||
|
|
||
| .. versionadded:: next | ||
|
|
||
|
|
||
| .. function:: indic_conjunct_break(chr, /) | ||
|
|
||
| Returns the Indic_Conjunct_Break property assigned to the character. | ||
|
|
||
| .. versionadded:: next | ||
|
|
||
|
|
||
| .. function:: extended_pictographic(chr, /) | ||
|
|
||
| Returns ``True`` if the character has the Extended_Pictographic property, | ||
| ``False`` otherwise. | ||
|
|
||
| .. versionadded:: next | ||
|
|
||
|
|
||
| .. function:: normalize(form, unistr, /) | ||
|
|
||
| Return the normal form *form* for the Unicode string *unistr*. Valid values for | ||
|
|
@@ -225,6 +247,24 @@ following functions: | |
| .. versionadded:: 3.8 | ||
|
|
||
|
|
||
| .. function:: iter_graphemes(unistr, start=0, end=sys.maxsize, /) | ||
|
|
||
| Returns an iterator to iterate over grapheme clusters. | ||
| With optional *start*, iteration beginning at that position. | ||
| With optional *end*, iteration stops at that position. | ||
|
|
||
| Converting an emitted item to string returns a substring corresponding to | ||
| the grapheme cluster. | ||
| Its ``start`` and ``end`` attributes denote the start and the end of | ||
| the grapheme cluster. | ||
|
|
||
| It uses extended grapheme cluster rules defined by Unicode | ||
| Standard Annex #29, `"Unicode Text Segmentation" | ||
| <https://www.unicode.org/reports/tr29/>`_. | ||
|
|
||
| .. versionadded:: next | ||
|
|
||
|
|
||
| In addition, the module exposes the following constant: | ||
|
|
||
| .. data:: unidata_version | ||
|
|
@@ -234,7 +274,7 @@ In addition, the module exposes the following constant: | |
|
|
||
| .. data:: ucd_3_2_0 | ||
|
|
||
| This is an object that has the same methods as the entire module, but uses the | ||
| This is an object that has most of the methods of the entire module, but uses the | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sentence is not fully right, but I can’t find the right suggestion with both «most of» and «same as». |
||
| Unicode database version 3.2 instead, for applications that require this | ||
| specific version of the Unicode database (such as IDNA). | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
8 changes: 8 additions & 0 deletions
8
Misc/NEWS.d/next/Library/2025-12-22-18-25-54.gh-issue-74902.HqrWUV.rst
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| Add the :func:`~unicodedata.iter_graphemes` function in the | ||
| :mod:`unicodedata` module to iterate over grapheme clusters according to | ||
| rules defined in `Unicode Standard Annex #29, "Unicode Text Segmentation" | ||
| <https://www.unicode.org/reports/tr29/>`_. Add | ||
| :func:`~unicodedata.grapheme_cluster_break`, | ||
| :func:`~unicodedata.indic_conjunct_break` and | ||
| :func:`~unicodedata.extended_pictographic` functions to get the properties | ||
| of the character which are related to the above algorithm. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of functions in this file doesn’t seem to be alphabetical or topical.
I think another ticket should be created to add a quick links table at the top.