Handle unicode text #131

feynmanliang · 2016-07-19T14:48:03Z

Attempting to use theanets.recurrent.Text on a UTF8 encoded corpus used to give an error

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
/home/fl350/bachbot/scripts/theanet/theanet.py in <module>()
     24 with codecs.open(path, 'r', 'utf-8') as handle:
     25     file_data = handle.read().lower()
---> 26     text = theanets.recurrent.Text(file_data[:int(VAL_FRACTION*len(file_data))])
     27     text_val = theanets.recurrent.Text(file_data[int(VAL_FRACTION*len(file_data)):])
     28

/home/fl350/theanets/theanets/recurrent.py in __init__(self, text, alpha, min_count, unknown)
     89                 collections.Counter(text).items()
     90                 if char != unknown and count >= min_count)))
---> 91         print type(r'[^{}]'.format(re.escape(self.alpha)).encode('utf8'))
     92         self.text = re.sub(r'[^{}]'.format(re.escape(self.alpha)).encode('utf8'), unknown, text)
     93         assert unknown not in self.alpha

UnicodeEncodeError: 'ascii' codec can't encode character u'\x83' in position 85: ordinal not in range(128)

This is fixed by this PR.

This change is

coveralls · 2016-07-19T15:37:29Z

Coverage decreased (-0.1%) to 94.768% when pulling eaca433 on feynmanliang:text-handle-utf into b637b01 on lmjohns3:master.

lmjohns3 · 2016-07-20T02:34:48Z

This can get pretty tricky with text encodings. My preference is to always operate with unicode, because then iterating over a string is guaranteed to iterate over a "letter" instead of iterating over parts of multi-byte characters. That said, I haven't been very careful about enforcing this!

This is additionally complicated by the fact that Py2 and Py3 have different defaults for handling strings. I personally use Py3 but I try to test everything with Py2 as well (see the Travis config).

Which version of Python are you using? Can you try using a "unicode" object instead of a UTF-8 encoded byte sequence to see if this problem persists? Can you add a test to run a unicode object through the recurrent infrastructure and add it to this PR? Also, this PR breaks an existing test, please fix.

feynmanliang · 2016-07-20T11:00:07Z

Thanks for taking a look, I will push some changes soon to address the issues

feynmanliang · 2016-07-20T13:42:44Z

I'm using 2.7.3
I can repro with the following code (assuming path points to a file with utf8 encoded strings)

with codecs.open(path, 'r', 'utf-8') as handle:
    file_data = handle.read().lower()
    text = theanets.recurrent.Text(file_data[:int(TRAIN_FRACTION*len(file_data))])
    text_val = theanets.recurrent.Text(file_data[int(TRAIN_FRACTION*len(file_data)):])

or using a unicode object

with open(path, 'r') as handle:
    file_data = unicode(handle.read(), 'utf-8').lower()
    text = theanets.recurrent.Text(file_data[:int(TRAIN_FRACTION*len(file_data))])
    text_val = theanets.recurrent.Text(file_data[int(TRAIN_FRACTION*len(file_data)):])

Handle unicode text

eaca433

feynmanliang added 2 commits July 19, 2016 16:43

Fix for 3.4

d107a0b

Fixes utf8

27c016d

Use unicode object

22360a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle unicode text #131

Handle unicode text #131

Uh oh!

feynmanliang commented Jul 19, 2016 •

edited

Loading

Uh oh!

coveralls commented Jul 19, 2016

Uh oh!

lmjohns3 commented Jul 20, 2016

Uh oh!

feynmanliang commented Jul 20, 2016

Uh oh!

feynmanliang commented Jul 20, 2016 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Handle unicode text #131

Are you sure you want to change the base?

Handle unicode text #131

Uh oh!

Conversation

feynmanliang commented Jul 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Jul 19, 2016

Uh oh!

lmjohns3 commented Jul 20, 2016

Uh oh!

feynmanliang commented Jul 20, 2016

Uh oh!

feynmanliang commented Jul 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feynmanliang commented Jul 19, 2016 •

edited

Loading

feynmanliang commented Jul 20, 2016 •

edited

Loading