Wrong UTF-8 decoding of Unicode Code Points higher then u+FFFF

When using UTF-8 encoding and insert a character where the unicode point is higher then u+FFFF. The Java ResultSet#getString() method return for this characters the same incorrect character. The Byte Representation seems to be correct.

I guess that that bug need to be in the C++ JNI Implementation https://github.com/xerial/sqlite-jdbc/blob/master/src/main/java/org/sqlite/core/NativeDB.c#L503
## How to reproduce
- Linux x64
- org.xerial:sqlite-jdbc:3.8.11.2
- java version 1.8.0_45 - Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

``` bash
# HINTS:
# - LANG=en_US.UTF-8

echo "CREATE TABLE TEST (id INTEGER PRIMARY KEY, name CHARSET);" | sqlite3 /tmp/sqlite.db -batch
echo "PRAGMA encoding = \"UTF-8\";" | sqlite3 /tmp/sqlite.db -batch

# LATIN CAPITAL LETTER A
# http://unicode-table.com/de/0041/
# http://www.fileformat.info/info/unicode/char/0041/index.htm
echo -e "INSERT INTO TEST (name) VALUES ('\x41');" | sqlite3 /tmp/sqlite.db -batch

# Miao Letter Archaic Ma
# http://unicode-table.com/de/16F06/
# http://www.fileformat.info/info/unicode/char/16F06/index.htm
echo -e "INSERT INTO TEST (name) VALUES ('\xF0\x96\xBC\x86');" | sqlite3 /tmp/sqlite.db -batch
```

``` java
Properties properties = new Properties();
properties.setProperty("characterEncoding", "UTF-8");
properties.setProperty("encoding", "\"UTF-8\"");
try (Connection connection = DriverManager.getConnection("jdbc:sqlite:/tmp/sqlite.db", properties)){
    Statement statement = connection.createStatement();
    ResultSet rs = statement.executeQuery("select name from TEST");
    while (rs.next()) {
        byte[] b = rs.getBytes("name");
        System.out.println("VALUE A = " + new String(b, java.nio.charset.StandardCharsets.UTF_8));

        String value = rs.getString("name");
        System.out.println("VALUE B = " + value);
    }
}
```
## Hints

Maybe it's a CESU-8 (Modified UTF-8) vs UTF-8 encoding issue. According the JNI specification: 

```
...Characters with code points above U+FFFF (so-called supplementary characters) are represented
 by separately encoding the two surrogate code units of their UTF-16 representation...
```

https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/types.html#modified_utf_8_strings
http://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542
## Ideas
- Probably the C++ JNI Code need to verify, if the database column value is in CESU-8 or UTF-8 encoded. And according that information it need to be converterd to a Java String (UTF-16) Object.
- use C++ sqlite3_column_text16 or sqlite3_column_bytes16 function and NewString to create a Java String Object
- Use PRAGMA encoding setting to decode VARCHAR Fields to Java String.
## FYI

I reported that bug already on https://bitbucket.org/xerial/sqlite-jdbc/issues/200 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong UTF-8 decoding of Unicode Code Points higher then u+FFFF #61

How to reproduce

Hints

Ideas

FYI

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Wrong UTF-8 decoding of Unicode Code Points higher then u+FFFF #61

Description

How to reproduce

Hints

Ideas

FYI

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions