Skip to content

Conversation

@nlitsme
Copy link
Contributor

@nlitsme nlitsme commented Jun 29, 2017

This works around java's broken utf-8 implementation.

You will need the https://github.com/LuminosoInsight/python-ftfy module for the patch to have an effect.

The following code will now output a 😃 ( \u0001f603 ), instead of raising a UnicodeDecodeError, or outputting ??????.

from __future__ import division, print_function
from binascii import a2b_hex
import javaobj

b = a2b_hex("ACED0005740006EDA0BDEDB883")
print(javaobj.loads(b))

The problem with the byte sequence ED A0 BD ED B8 83 is that it decodes to d83d de03 which are invalid codepoints, but is actually a valid UTF-16 sequence, so you have to decode it twice, first utf-8, then utf-16, then you will end up with unicode character 0x1F603.

@tcalmant tcalmant merged commit 07ca2a0 into tcalmant:master Jun 29, 2017
@tcalmant
Copy link
Owner

Thanks for your contribution !
I'll add a word about the ftfy package in the README.

@nlitsme nlitsme deleted the itsme-cesu8 branch June 29, 2017 13:53
@voetsjoeba
Copy link
Contributor

Note that the CESU-8/Java-UTF-8 decoder in ftfy.bad_codecs does not enforce correctness, and is documented as being explicitly intended not to do so.

Here's an example of a byte sequence that is invalid CESU-8 and is rejected by Java, but is accepted by ftfy's decoder:

import ftfy.bad_codecs
print(b'\xf0\x90\x80\x80'.decode("java_utf8", errors="strict") == u"\U00010000") # True

So be careful not to rely on the codec to make accept/reject decisions about the validity of serialized objects ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants