bug: sanitizeSurrogates() misses JSON-escaped lone surrogates (\uD800 ASCII escape) causing Anthropic API 400 errors

## Bug Description

OpenCode's `sanitizeSurrogates()` in `packages/opencode/src/util/sanitize-surrogates.ts` (added in the custom `fetch` wrapper in `provider.ts`) fails to catch lone surrogates that have already been serialized by `JSON.stringify()` as ASCII escape sequences.

**Root cause**: `JSON.stringify()` encodes lone surrogate code units as 6-byte ASCII sequences (e.g., `\uD82C`). At this point, the string contains only ASCII characters. `String.prototype.isWellFormed()` returns `true` because there are no literal surrogate code units — but the string still contains invalid JSON per RFC 8259. Anthropic's API backend (`yyjson`/`serde_json`) rejects these with:

```
"no low surrogate in string"
```

This causes the entire session to hang with a 400 error from the Anthropic API.

## Reproduction

```typescript
import { sanitizeSurrogates } from "./packages/opencode/src/util/sanitize-surrogates"

// JSON.stringify encodes the lone surrogate as \uD82C (ASCII escape)
const serialized = JSON.stringify({ text: "\uD82C" })
// serialized = '{"text":"\\uD82C"}'

// isWellFormed() returns true — no literal surrogates!
console.log(serialized.isWellFormed()) // true

// sanitizeSurrogates passes it through unchanged
const result = sanitizeSurrogates(serialized)
// result still contains \uD82C — Anthropic API rejects this
```

## Impact

Any tool that produces output containing lone Unicode surrogates (binary file reads, terminal output with non-UTF-8 data, certain East Asian characters in some encodings) will cause the session to fail with a cryptic 400 error.

Reported in: [anthropics/claude-code#1709](https://github.com/anthropics/claude-code/issues/1709), [anthropics/claude-code#1832](https://github.com/anthropics/claude-code/issues/1832)

## Proposed Fix

Add a JSON-lexer pass **before** the `isWellFormed()` check that finds `\uDxxx` escape sequences and replaces lone ones with `\uFFFD`:

```typescript
function hex4ToInt(s: string, idx: number): number {
  let x = 0
  for (let i = 0; i < 4; i++) {
    const c = s.charCodeAt(idx + i)
    let v = -1
    if (c >= 0x30 && c <= 0x39) v = c - 0x30
    else if (c >= 0x41 && c <= 0x46) v = c - 0x41 + 10
    else if (c >= 0x61 && c <= 0x66) v = c - 0x61 + 10
    else return -1
    x = (x << 4) | v
  }
  return x
}

function fixJsonSurrogateEscapes(jsonText: string): string {
  // Fast-path: no surrogate escape sequences
  if (jsonText.indexOf("\\uD") === -1 && jsonText.indexOf("\\ud") === -1) {
    return jsonText
  }

  let inString = false
  let out: string[] | null = null
  let last = 0

  for (let i = 0; i < jsonText.length; i++) {
    const ch = jsonText.charCodeAt(i)
    if (!inString) {
      if (ch === 0x22 /* " */) inString = true
      continue
    }
    if (ch === 0x22) { inString = false; continue }
    if (ch !== 0x5c /* \ */) continue
    const next = jsonText.charCodeAt(i + 1)
    if (next !== 0x75 /* u */) { i += 1; continue }
    if (i + 5 >= jsonText.length) { i += 1; continue }

    const code = hex4ToInt(jsonText, i + 2)
    if (code < 0) { i += 1; continue }

    const isHigh = code >= 0xd800 && code <= 0xdbff
    const isLow  = code >= 0xdc00 && code <= 0xdfff
    if (!isHigh && !isLow) { i += 5; continue }

    if (isHigh) {
      const j = i + 6
      if (j + 5 < jsonText.length &&
          jsonText.charCodeAt(j) === 0x5c &&
          jsonText.charCodeAt(j + 1) === 0x75) {
        const code2 = hex4ToInt(jsonText, j + 2)
        if (code2 >= 0xdc00 && code2 <= 0xdfff) { i += 11; continue }  // valid pair
      }
    }

    // Lone surrogate — replace with \uFFFD
    if (!out) out = []
    out.push(jsonText.slice(last, i))
    out.push("\\uFFFD")
    last = i + 6
    i += 5
  }

  if (!out) return jsonText
  out.push(jsonText.slice(last))
  return out.join("")
}

export function sanitizeSurrogates(s: string): string {
  if (typeof s !== "string" || s.length === 0) return s
  s = fixJsonSurrogateEscapes(s)   // ← add this
  if (typeof s.isWellFormed === "function" && s.isWellFormed()) return s
  if (typeof s.toWellFormed === "function") return s.toWellFormed()
  return s.replace(/[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/g, "\uFFFD")
}
```

A PR with tests will follow.

---

> **Note**: Reported and fix developed with AI assistance (Claude Code). The bug was encountered in production when a Korean character was encoded as a lone surrogate in a file path, causing an `opencode` session to hang with a 400 error from the Anthropic API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: sanitizeSurrogates() misses JSON-escaped lone surrogates (\uD800 ASCII escape) causing Anthropic API 400 errors #14630

Bug Description

Reproduction

Impact

Proposed Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug: sanitizeSurrogates() misses JSON-escaped lone surrogates (\uD800 ASCII escape) causing Anthropic API 400 errors #14630

Description

Bug Description

Reproduction

Impact

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions