fix: validate low surrogate before combining surrogate pair in decode2utf8#393
fix: validate low surrogate before combining surrogate pair in decode2utf8#393xingcici wants to merge 1 commit intoapache:masterfrom
Conversation
…2utf8 Before this fix, when decode2utf8 encountered a high surrogate (0xD800–0xDBFF), it unconditionally combined the following 3 bytes as a low surrogate, even when those bytes did not encode a value in [0xDC00, 0xDFFF]. This corrupted the characters following a lone high surrogate sent by Java clients. The fix adds a guard `if c2 >= 0xDC00 && c2 <= 0xDFFF` before the surrogate-pair combine, and falls through to treat the high surrogate as an independent 3-byte sequence when no valid low surrogate follows. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
opus 4.6 PR #393 分析 问题背景 decode2utf8 函数负责将 Java hessian2 编码的 CESU-8 字节流解码为 Go 的 UTF-8。Java 的 hessian2 使用 CESU-8 编码,会将 BMP 之外的字符(如 emoji)拆成代理对(surrogate pair)来传输:
Bug 原因 原代码 (string.go:377-395):当检测到高代理 c1 ∈ [0xD800, 0xDBFF] 后,无条件读取接下来的 3 字节并将其当作低代理进行组合: c2 := ((uint32(data[start+3]) & 0x0f) << 12) + ... 如果 Java 端发送了一个孤立的高代理(lone high surrogate,后面跟的不是低代理),比如 \ud83d 后面跟 ...(即 0x2E 0x2E 0x2E),那么:
修复方式 在组合代理对之前加了一个条件判断: if c2 >= 0xDC00 && c2 <= 0xDFFF { 当 c2 不是合法低代理时,不组合,高代理作为独立的 3 字节 CESU-8 序列保留,后续字节正常处理。 测试用例 测试构造了 Java hessian2 编码 "test_go_hessian2\ud83d..." 的字节流,验证解码后:
评估 优点
潜在问题
所以这个行为是合理的。 结论 建议通过。这是一个清晰、最小化的 bug fix,正确解决了孤立高代理导致数据损坏的问题,不引入新的回归风险。逻辑正确,测试覆盖了核心场景。 |
|
@xingcici When will a Java client send a string containing a lone high surrogate? can u provide an example? |
The scenario I encountered is: the user truncated their string by length, and the truncation point happened to fall on an emoji — specifically cutting right at the high surrogate, leaving only the high surrogate in the string. |
Problem
When a Java client sends a string containing a lone high surrogate (e.g.
\ud83d),decode2utf8unconditionally combines the following 3 bytes as a low surrogate, even when those bytes do not encode a value in the low-surrogate range[0xDC00, 0xDFFF], corrupting the characters that follow.Reproducer:
"test_go_hessian2\ud83d..."—\ud83dis a lone high surrogate followed by...(0x2E 0x2E 0x2E, c2 = 0x002E), which is outside the low-surrogate range.Fix
Add a guard
if c2 >= 0xDC00 && c2 <= 0xDFFFindecode2utf8after reading the candidate low-surrogate bytes. The surrogate pair is only combined when a valid low surrogate follows; otherwise the high surrogate is left as an independent 3-byte sequence and decoding continues normally.Test
Added
TestDecodeStringLoneHighSurrogateRealWorldusing a byte stream captured directly from Java hessian2 viaHessian2Output.writeString.🤖 Generated with Claude Code