Basic analysis of the script content of strings by jwiggins · Pull Request #764 · enthought/enable

jwiggins · 2021-03-29T15:43:05Z

This is another part of #762

There are two pieces here, and a ton of machine-generated code (don't fear the diff):

A small program which fetches http://www.unicode.org/Public/UNIDATA/Scripts.txt, parses it, and writes out the kiva.fonttools.text._data module.
A new class UnicodeAnalyzer which uses the data from Scripts.txt and returns the languages and slices for a given input string.

UnicodeAnalyzer is pretty basic right now. I'd like to keep it that way in this PR. For instance, Emoji ligatures are not great:

In [3]: s = "👩‍👩‍👧‍👧" 

In [4]: s                                                                       
Out[4]: '👩\u200d👩\u200d👧\u200d👧'

In [5]: an.languages(s)                                                         
Out[5]: 
[(0, 1, 'Common'),
 (1, 2, 'Inherited'),
 (2, 3, 'Common'),
 (3, 4, 'Inherited'),
 (4, 5, 'Common'),
 (5, 6, 'Inherited'),
 (6, 7, 'Common')]

jwiggins · 2021-03-29T15:44:15Z

+    94: "Tai_Le",
+    95: "New_Tai_Lue",


These were changed to match the names in kiva.fonttools.text._data.

rahulporuri

LGTM

rahulporuri · 2021-03-30T08:32:22Z

+
+    def _lookup_codepoint(self, cp):
+        comps = self.ranges - ord(cp)
+        index = ((comps[:, 0] <= 0) == (comps[:, 1] >= 0)).argmax()


This looks like the most important detail in the PR - how we're selecting the entry given a code point - and it'd be useful if you could elaborate on how we're doing it.

Fair enough

rahulporuri · 2021-03-30T09:04:51Z

still LGTM

jwiggins · 2021-03-30T09:09:19Z

Thanks for the review

Add a parser for the Unicode Scripts.txt file

c925724

jwiggins commented Mar 29, 2021

View reviewed changes

rahulporuri approved these changes Mar 30, 2021

View reviewed changes

PR feedback

94fcdf1

jwiggins merged commit ffd2535 into master Mar 30, 2021

jwiggins deleted the feature/language-parse branch March 30, 2021 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic analysis of the script content of strings#764

Basic analysis of the script content of strings#764
jwiggins merged 2 commits into
masterfrom
feature/language-parse

jwiggins commented Mar 29, 2021

Uh oh!

jwiggins Mar 29, 2021

Uh oh!

rahulporuri left a comment

Uh oh!

rahulporuri Mar 30, 2021

Uh oh!

jwiggins Mar 30, 2021

Uh oh!

rahulporuri commented Mar 30, 2021

Uh oh!

jwiggins commented Mar 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jwiggins commented Mar 29, 2021

Uh oh!

jwiggins Mar 29, 2021

Choose a reason for hiding this comment

Uh oh!

rahulporuri left a comment

Choose a reason for hiding this comment

Uh oh!

rahulporuri Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

jwiggins Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

rahulporuri commented Mar 30, 2021

Uh oh!

jwiggins commented Mar 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants