Skip to content

fix: validate Unicode codepoints in utf8_encode()#4

Merged
mitchellh merged 1 commit into
ghostty-org:mainfrom
hobostay:fix/utf8-codepoint-validation
Mar 23, 2026
Merged

fix: validate Unicode codepoints in utf8_encode()#4
mitchellh merged 1 commit into
ghostty-org:mainfrom
hobostay:fix/utf8-codepoint-validation

Conversation

@hobostay
Copy link
Copy Markdown
Contributor

Summary

  • Add validation to utf8_encode() to ensure codepoints are within valid Unicode range (U+0000 to U+10FFFF)
  • Replace invalid codepoints with Unicode replacement character U+FFFD
  • Prevents generation of malformed UTF-8 sequences

Details

The Unicode standard defines the maximum valid codepoint as U+10FFFF (RFC 3629). The current utf8_encode() function accepts any 32-bit value >= 0x10000 and encodes it as 4-byte UTF-8, which can produce invalid sequences for values > 0x10FFFF.

This fix validates the input codepoint and replaces out-of-range values with U+FFFD () before encoding, ensuring the output is always valid UTF-8.

Test plan

  • Code compiles without warnings
  • Follows RFC 3629 UTF-8 encoding rules
  • Maintains backward compatibility for valid codepoints

🤖 Generated with Claude Code

The Unicode standard defines the maximum valid codepoint as U+10FFFF.
Codepoints above this value are invalid and produce malformed UTF-8
sequences. This patch adds validation to replace out-of-range codepoints
with the Unicode replacement character U+FFFD.

This follows RFC 3629 which restricted UTF-8 to encode no more than
U+10FFFF to avoid UTF-16 surrogate pairs and maintain consistency
with the Unicode standard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mitchellh mitchellh merged commit d6e707a into ghostty-org:main Mar 23, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants