-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
gh-74902: Add Unicode Grapheme Cluster Break algorithm #143076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
gh-74902: Add Unicode Grapheme Cluster Break algorithm #143076
Conversation
Add the unicodedata.iter_graphemes() function to iterate over grapheme clusters according to rules defined in Unicode Standard Annex python#29. Add unicodedata.grapheme_cluster_break(), unicodedata.indic_conjunct_break() and unicodedata.extended_pictographic() functions to get the properties of the character which are related to the above algorithm. Co-authored-by: Guillaume "Vermeille" Sanchez <guillaume.v.sanchez@gmail.com>
| <https://www.unicode.org/reports/tr29/>`_. | ||
| Add :func:`~unicodedata.grapheme_cluster_break`, | ||
| :func:`~unicodedata.indic_conjunct_break` and | ||
| :func:`~unicodedata.extended_pictographic` functions to get the properties | ||
| of the character which are related to the above algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| <https://www.unicode.org/reports/tr29/>`_. | |
| Add :func:`~unicodedata.grapheme_cluster_break`, | |
| :func:`~unicodedata.indic_conjunct_break` and | |
| :func:`~unicodedata.extended_pictographic` functions to get the properties | |
| of the character which are related to the above algorithm. | |
| <https://www.unicode.org/reports/tr29/>`_. | |
| Add :func:`~unicodedata.grapheme_cluster_break`, | |
| :func:`~unicodedata.indic_conjunct_break` and | |
| :func:`~unicodedata.extended_pictographic` functions to get the properties | |
| of the character which are related to the above algorithm. |
|
|
||
| Converting an emitted item to string returns a substring corresponding to | ||
| the grapheme cluster. | ||
| It's ``start`` and ``end`` attributes denote the start and the end of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| It's ``start`` and ``end`` attributes denote the start and the end of | |
| Its ``start`` and ``end`` attributes denote the start and the end of |
| ``False`` otherwise. | ||
|
|
||
| .. versionadded:: next | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of functions in this file doesn’t seem to be alphabetical or topical.
I think another ticket should be created to add a quick links table at the top.
| .. data:: ucd_3_2_0 | ||
|
|
||
| This is an object that has the same methods as the entire module, but uses the | ||
| This is an object that has most of the methods of the entire module, but uses the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is not fully right, but I can’t find the right suggestion with both «most of» and «same as».
|
These functions help compute width? |
|
At least two implementations (in Perl's Unicode::GCString and builtin in C++) use graphemes. Naive implementation in C's |
Add the unicodedata.iter_graphemes() function to iterate over grapheme clusters according to rules defined in Unicode Standard Annex
#29.Add unicodedata.grapheme_cluster_break(), unicodedata.indic_conjunct_break() and unicodedata.extended_pictographic() functions to get the properties of the character which are related to the above algorithm.