Truncate transform on strings with Unicode characters

The specification for truncate says

> *Substring of length `L`*

but does not define what it is counting. I assume the intention is for it to be Unicode code points, since the specification says that

> Character strings must be stored as UTF-8 encoded byte arrays

However, the Java reference implementation uses `java.lang.CharSequence#subSequence`, thus the length is in terms of 16-bit code units, and thus is different for code points for characters outside of the Basic Multilingual Plane (BMP). Such code points require two characters, encoded using a high and low surrogate pair. Additionally, the truncation may happen in the middle of the surrogate pair, which is a form of corruption.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncate transform on strings with Unicode characters #293

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Truncate transform on strings with Unicode characters #293

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions