Skip to content

Truncate transform on strings with Unicode characters #293

@electrum

Description

@electrum

The specification for truncate says

Substring of length L

but does not define what it is counting. I assume the intention is for it to be Unicode code points, since the specification says that

Character strings must be stored as UTF-8 encoded byte arrays

However, the Java reference implementation uses java.lang.CharSequence#subSequence, thus the length is in terms of 16-bit code units, and thus is different for code points for characters outside of the Basic Multilingual Plane (BMP). Such code points require two characters, encoded using a high and low surrogate pair. Additionally, the truncation may happen in the middle of the surrogate pair, which is a form of corruption.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions