Skip to content

Wrong UTF-8 decoding of Unicode Code Points higher then u+FFFF #61

@ghost

Description

When using UTF-8 encoding and insert a character where the unicode point is higher then u+FFFF. The Java ResultSet#getString() method return for this characters the same incorrect character. The Byte Representation seems to be correct.

I guess that that bug need to be in the C++ JNI Implementation https://github.com/xerial/sqlite-jdbc/blob/master/src/main/java/org/sqlite/core/NativeDB.c#L503

How to reproduce

  • Linux x64
  • org.xerial:sqlite-jdbc:3.8.11.2
  • java version 1.8.0_45 - Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
# HINTS:
# - LANG=en_US.UTF-8

echo "CREATE TABLE TEST (id INTEGER PRIMARY KEY, name CHARSET);" | sqlite3 /tmp/sqlite.db -batch
echo "PRAGMA encoding = \"UTF-8\";" | sqlite3 /tmp/sqlite.db -batch

# LATIN CAPITAL LETTER A
# http://unicode-table.com/de/0041/
# http://www.fileformat.info/info/unicode/char/0041/index.htm
echo -e "INSERT INTO TEST (name) VALUES ('\x41');" | sqlite3 /tmp/sqlite.db -batch

# Miao Letter Archaic Ma
# http://unicode-table.com/de/16F06/
# http://www.fileformat.info/info/unicode/char/16F06/index.htm
echo -e "INSERT INTO TEST (name) VALUES ('\xF0\x96\xBC\x86');" | sqlite3 /tmp/sqlite.db -batch
Properties properties = new Properties();
properties.setProperty("characterEncoding", "UTF-8");
properties.setProperty("encoding", "\"UTF-8\"");
try (Connection connection = DriverManager.getConnection("jdbc:sqlite:/tmp/sqlite.db", properties)){
    Statement statement = connection.createStatement();
    ResultSet rs = statement.executeQuery("select name from TEST");
    while (rs.next()) {
        byte[] b = rs.getBytes("name");
        System.out.println("VALUE A = " + new String(b, java.nio.charset.StandardCharsets.UTF_8));

        String value = rs.getString("name");
        System.out.println("VALUE B = " + value);
    }
}

Hints

Maybe it's a CESU-8 (Modified UTF-8) vs UTF-8 encoding issue. According the JNI specification:

...Characters with code points above U+FFFF (so-called supplementary characters) are represented
 by separately encoding the two surrogate code units of their UTF-16 representation...

https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/types.html#modified_utf_8_strings
http://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542

Ideas

  • Probably the C++ JNI Code need to verify, if the database column value is in CESU-8 or UTF-8 encoded. And according that information it need to be converterd to a Java String (UTF-16) Object.
  • use C++ sqlite3_column_text16 or sqlite3_column_bytes16 function and NewString to create a Java String Object
  • Use PRAGMA encoding setting to decode VARCHAR Fields to Java String.

FYI

I reported that bug already on https://bitbucket.org/xerial/sqlite-jdbc/issues/200

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions