Add support for decoding CESU-8 encoded strings. #17

nlitsme · 2017-06-29T13:33:12Z

This works around java's broken utf-8 implementation.

You will need the https://github.com/LuminosoInsight/python-ftfy module for the patch to have an effect.

The following code will now output a 😃 ( \u0001f603 ), instead of raising a UnicodeDecodeError, or outputting ??????.

from __future__ import division, print_function
from binascii import a2b_hex
import javaobj

b = a2b_hex("ACED0005740006EDA0BDEDB883")
print(javaobj.loads(b))

The problem with the byte sequence ED A0 BD ED B8 83 is that it decodes to d83d de03 which are invalid codepoints, but is actually a valid UTF-16 sequence, so you have to decode it twice, first utf-8, then utf-16, then you will end up with unicode character 0x1F603.

…va's broken utf-8 implementation.

tcalmant · 2017-06-29T13:41:07Z

Thanks for your contribution !
I'll add a word about the ftfy package in the README.

voetsjoeba · 2017-07-13T21:32:47Z

Note that the CESU-8/Java-UTF-8 decoder in ftfy.bad_codecs does not enforce correctness, and is documented as being explicitly intended not to do so.

Here's an example of a byte sequence that is invalid CESU-8 and is rejected by Java, but is accepted by ftfy's decoder:

import ftfy.bad_codecs
print(b'\xf0\x90\x80\x80'.decode("java_utf8", errors="strict") == u"\U00010000") # True

So be careful not to rely on the codec to make accept/reject decisions about the validity of serialized objects ...

Add support for decoding CESU-8 encoded strings. This works around ja…

492e5e7

…va's broken utf-8 implementation.

tcalmant merged commit 07ca2a0 into tcalmant:master Jun 29, 2017

nlitsme deleted the itsme-cesu8 branch June 29, 2017 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for decoding CESU-8 encoded strings. #17

Add support for decoding CESU-8 encoded strings. #17

Uh oh!

nlitsme commented Jun 29, 2017 •

edited

Loading

Uh oh!

tcalmant commented Jun 29, 2017

Uh oh!

voetsjoeba commented Jul 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add support for decoding CESU-8 encoded strings. #17

Add support for decoding CESU-8 encoded strings. #17

Uh oh!

Conversation

nlitsme commented Jun 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcalmant commented Jun 29, 2017

Uh oh!

voetsjoeba commented Jul 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nlitsme commented Jun 29, 2017 •

edited

Loading