Skip to content

bug: sanitizeSurrogates() misses JSON-escaped lone surrogates (\uD800 ASCII escape) causing Anthropic API 400 errors #14630

@codeg-dev

Description

@codeg-dev

Bug Description

OpenCode's sanitizeSurrogates() in packages/opencode/src/util/sanitize-surrogates.ts (added in the custom fetch wrapper in provider.ts) fails to catch lone surrogates that have already been serialized by JSON.stringify() as ASCII escape sequences.

Root cause: JSON.stringify() encodes lone surrogate code units as 6-byte ASCII sequences (e.g., \uD82C). At this point, the string contains only ASCII characters. String.prototype.isWellFormed() returns true because there are no literal surrogate code units — but the string still contains invalid JSON per RFC 8259. Anthropic's API backend (yyjson/serde_json) rejects these with:

"no low surrogate in string"

This causes the entire session to hang with a 400 error from the Anthropic API.

Reproduction

import { sanitizeSurrogates } from "./packages/opencode/src/util/sanitize-surrogates"

// JSON.stringify encodes the lone surrogate as \uD82C (ASCII escape)
const serialized = JSON.stringify({ text: "\uD82C" })
// serialized = '{"text":"\\uD82C"}'

// isWellFormed() returns true — no literal surrogates!
console.log(serialized.isWellFormed()) // true

// sanitizeSurrogates passes it through unchanged
const result = sanitizeSurrogates(serialized)
// result still contains \uD82C — Anthropic API rejects this

Impact

Any tool that produces output containing lone Unicode surrogates (binary file reads, terminal output with non-UTF-8 data, certain East Asian characters in some encodings) will cause the session to fail with a cryptic 400 error.

Reported in: anthropics/claude-code#1709, anthropics/claude-code#1832

Proposed Fix

Add a JSON-lexer pass before the isWellFormed() check that finds \uDxxx escape sequences and replaces lone ones with \uFFFD:

function hex4ToInt(s: string, idx: number): number {
  let x = 0
  for (let i = 0; i < 4; i++) {
    const c = s.charCodeAt(idx + i)
    let v = -1
    if (c >= 0x30 && c <= 0x39) v = c - 0x30
    else if (c >= 0x41 && c <= 0x46) v = c - 0x41 + 10
    else if (c >= 0x61 && c <= 0x66) v = c - 0x61 + 10
    else return -1
    x = (x << 4) | v
  }
  return x
}

function fixJsonSurrogateEscapes(jsonText: string): string {
  // Fast-path: no surrogate escape sequences
  if (jsonText.indexOf("\\uD") === -1 && jsonText.indexOf("\\ud") === -1) {
    return jsonText
  }

  let inString = false
  let out: string[] | null = null
  let last = 0

  for (let i = 0; i < jsonText.length; i++) {
    const ch = jsonText.charCodeAt(i)
    if (!inString) {
      if (ch === 0x22 /* " */) inString = true
      continue
    }
    if (ch === 0x22) { inString = false; continue }
    if (ch !== 0x5c /* \ */) continue
    const next = jsonText.charCodeAt(i + 1)
    if (next !== 0x75 /* u */) { i += 1; continue }
    if (i + 5 >= jsonText.length) { i += 1; continue }

    const code = hex4ToInt(jsonText, i + 2)
    if (code < 0) { i += 1; continue }

    const isHigh = code >= 0xd800 && code <= 0xdbff
    const isLow  = code >= 0xdc00 && code <= 0xdfff
    if (!isHigh && !isLow) { i += 5; continue }

    if (isHigh) {
      const j = i + 6
      if (j + 5 < jsonText.length &&
          jsonText.charCodeAt(j) === 0x5c &&
          jsonText.charCodeAt(j + 1) === 0x75) {
        const code2 = hex4ToInt(jsonText, j + 2)
        if (code2 >= 0xdc00 && code2 <= 0xdfff) { i += 11; continue }  // valid pair
      }
    }

    // Lone surrogate — replace with \uFFFD
    if (!out) out = []
    out.push(jsonText.slice(last, i))
    out.push("\\uFFFD")
    last = i + 6
    i += 5
  }

  if (!out) return jsonText
  out.push(jsonText.slice(last))
  return out.join("")
}

export function sanitizeSurrogates(s: string): string {
  if (typeof s !== "string" || s.length === 0) return s
  s = fixJsonSurrogateEscapes(s)   // ← add this
  if (typeof s.isWellFormed === "function" && s.isWellFormed()) return s
  if (typeof s.toWellFormed === "function") return s.toWellFormed()
  return s.replace(/[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/g, "\uFFFD")
}

A PR with tests will follow.


Note: Reported and fix developed with AI assistance (Claude Code). The bug was encountered in production when a Korean character was encoded as a lone surrogate in a file path, causing an opencode session to hang with a 400 error from the Anthropic API.

Metadata

Metadata

Assignees

Labels

coreAnything pertaining to core functionality of the application (opencode server stuff)

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions