Skip to content

fix: validate low surrogate before combining surrogate pair in decode2utf8#393

Open
xingcici wants to merge 1 commit intoapache:masterfrom
xingcici:fix/high_surrogate
Open

fix: validate low surrogate before combining surrogate pair in decode2utf8#393
xingcici wants to merge 1 commit intoapache:masterfrom
xingcici:fix/high_surrogate

Conversation

@xingcici
Copy link
Copy Markdown

@xingcici xingcici commented Apr 1, 2026

Problem

When a Java client sends a string containing a lone high surrogate (e.g. \ud83d), decode2utf8 unconditionally combines the following 3 bytes as a low surrogate, even when those bytes do not encode a value in the low-surrogate range [0xDC00, 0xDFFF], corrupting the characters that follow.

Reproducer: "test_go_hessian2\ud83d..."\ud83d is a lone high surrogate followed by ... (0x2E 0x2E 0x2E, c2 = 0x002E), which is outside the low-surrogate range.

Fix

Add a guard if c2 >= 0xDC00 && c2 <= 0xDFFF in decode2utf8 after reading the candidate low-surrogate bytes. The surrogate pair is only combined when a valid low surrogate follows; otherwise the high surrogate is left as an independent 3-byte sequence and decoding continues normally.

Test

Added TestDecodeStringLoneHighSurrogateRealWorld using a byte stream captured directly from Java hessian2 via Hessian2Output.writeString.

🤖 Generated with Claude Code

…2utf8

Before this fix, when decode2utf8 encountered a high surrogate (0xD800–0xDBFF),
it unconditionally combined the following 3 bytes as a low surrogate, even when
those bytes did not encode a value in [0xDC00, 0xDFFF]. This corrupted the
characters following a lone high surrogate sent by Java clients.

The fix adds a guard `if c2 >= 0xDC00 && c2 <= 0xDFFF` before the surrogate-pair
combine, and falls through to treat the high surrogate as an independent 3-byte
sequence when no valid low surrogate follows.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tiltwind
Copy link
Copy Markdown
Contributor

tiltwind commented Apr 5, 2026

opus 4.6 PR #393 分析

问题背景

decode2utf8 函数负责将 Java hessian2 编码的 CESU-8 字节流解码为 Go 的 UTF-8。Java 的 hessian2 使用 CESU-8 编码,会将 BMP 之外的字符(如 emoji)拆成代理对(surrogate pair)来传输:

  • 高代理(High Surrogate):0xD800–0xDBFF
  • 低代理(Low Surrogate):0xDC00–0xDFFF

Bug 原因

原代码 (string.go:377-395):当检测到高代理 c1 ∈ [0xD800, 0xDBFF] 后,无条件读取接下来的 3 字节并将其当作低代理进行组合:

c2 := ((uint32(data[start+3]) & 0x0f) << 12) + ...
c := (c1-0xD800)<<10 + (c2 - 0xDC00) + 0x10000 // 没有验证 c2 是否真的是低代理!

如果 Java 端发送了一个孤立的高代理(lone high surrogate,后面跟的不是低代理),比如 \ud83d 后面跟 ...(即 0x2E 0x2E 0x2E),那么:

  1. c2 被计算为 0x002E,根本不在 [0xDC00, 0xDFFF] 范围内
  2. 但代码仍然执行 (c2 - 0xDC00),产生下溢/错误值
  3. 最终合成一个错误的 Unicode 码点,吞掉了后续的 3 个字节,导致字符串被损坏

修复方式

在组合代理对之前加了一个条件判断:

if c2 >= 0xDC00 && c2 <= 0xDFFF {
// 是合法的低代理,正常组合
...
continue
}
// 不是低代理,跳过组合,走 start += 3 的正常 3 字节路径

当 c2 不是合法低代理时,不组合,高代理作为独立的 3 字节 CESU-8 序列保留,后续字节正常处理。

测试用例

测试构造了 Java hessian2 编码 "test_go_hessian2\ud83d..." 的字节流,验证解码后:

  • 高代理 \ud83d 保持为原始 3 字节 \xed\xa0\xbd
  • 后续的 ... 不被吞掉

评估

优点

  1. Bug 是真实的:孤立高代理后面跟非代理字符时,原代码确实会损坏数据,这是一个正确的 bug fix
  2. 修改范围最小化:只加了一个 if 守卫,不改变正常代理对的处理路径
  3. 有对应的测试用例:使用了真实的 Java 编码字节流作为输入

潜在问题

  1. 已读取的 3 字节被浪费:当检测到 c2 不是低代理时,代码走 start += 3 路径,只消耗了高代理的 3 字节。但它已经预读了后续 3 字节(io.ReadFull(r, data[end:start+6])),这些字节已经在 buffer 中了(end 已更新到
    start+6),所以下一轮循环会正常处理它们——这没问题,buffer 数据不会丢失。
  2. charCount 计数:fall-through 路径走到 start += 3; charCount++,将孤立高代理计为 1 个字符。这与 Java 端 writeString 计算长度的方式一致(Java 把每个代理计为 1 个 char),所以计数正确。
  3. 输出中保留无效 UTF-8:孤立高代理 \xed\xa0\xbd 在 UTF-8 中技术上是非法的(RFC 3629 禁止编码代理码点)。但考虑到:
    - Go 的 string 类型本身不强制 UTF-8 合法性
    - 这是从 Java 传来的数据,保持原始字节是最安全的选择
    - 替换为 U+FFFD 或其他处理可能导致数据不可逆

所以这个行为是合理的。
4. 缺少孤立低代理的处理:代码没有检查 c1 ∈ [0xDC00, 0xDFFF](孤立低代理)的情况,但这在当前代码中走的是 start += 3 的普通 3 字节路径,不会造成数据损坏,属于已有的行为,不是本 PR 需要解决的问题。

结论

建议通过。这是一个清晰、最小化的 bug fix,正确解决了孤立高代理导致数据损坏的问题,不引入新的回归风险。逻辑正确,测试覆盖了核心场景。

@tiltwind
Copy link
Copy Markdown
Contributor

tiltwind commented Apr 5, 2026

@xingcici When will a Java client send a string containing a lone high surrogate? can u provide an example?

@xingcici
Copy link
Copy Markdown
Author

xingcici commented Apr 8, 2026

@xingcici When will a Java client send a string containing a lone high surrogate? can u provide an example?

The scenario I encountered is: the user truncated their string by length, and the truncation point happened to fall on an emoji — specifically cutting right at the high surrogate, leaving only the high surrogate in the string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants