Skip to content

fix false positive binary detection for files containing U+FFFD#24685

Closed
knQzx wants to merge 4 commits intogoogle-gemini:mainfrom
knQzx:fix-unicode-fffd-crash
Closed

fix false positive binary detection for files containing U+FFFD#24685
knQzx wants to merge 4 commits intogoogle-gemini:mainfrom
knQzx:fix-unicode-fffd-crash

Conversation

@knQzx
Copy link
Copy Markdown

@knQzx knQzx commented Apr 4, 2026

Summary

  • files containing the unicode replacement character (U+FFFD) are incorrectly flagged as binary, causing a crash when trying to read valid source files (e.g. rust files that use this char intentionally)
  • replaced the naive high-byte heuristic with proper UTF-8 multibyte sequence validation — valid sequences (including U+FFFD encoded as EF BF BD) are now skipped, only truly invalid byte sequences count toward the non-printable ratio

Test plan

  • added test: file with literal U+FFFD chars → detected as text
  • added test: file with CJK + emoji + U+FFFD → detected as text
  • added test: file with invalid raw UTF-8 bytes → still detected as binary
  • all 89 existing tests pass

fixes #24547

the binary detection heuristic was not utf-8 aware - high bytes
(>= 0x80) were not being validated as utf-8 sequences, which could
cause files containing valid multibyte characters like U+FFFD (the
unicode replacement character) to be misclassified as binary.

now we validate multibyte utf-8 sequences properly and only count
genuinely invalid byte sequences toward the non-printable ratio.
@knQzx knQzx requested a review from a team as a code owner April 4, 2026 15:20
@google-cla
Copy link
Copy Markdown

google-cla Bot commented Apr 4, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where text files containing valid Unicode characters, specifically the replacement character (U+FFFD), were being incorrectly identified as binary files. By implementing a proper UTF-8 multibyte sequence validation, the binary detection logic now accurately distinguishes between valid text encoding and actual binary data, preventing crashes and improving file handling reliability.

Highlights

  • Improved Binary Detection Heuristic: Replaced the naive high-byte check with a robust UTF-8 multibyte sequence validation to correctly identify text files.
  • Fixed False Positives: Files containing the Unicode replacement character (U+FFFD) or other valid UTF-8 sequences are no longer incorrectly flagged as binary.
  • Enhanced Test Coverage: Added new test cases to verify that U+FFFD, CJK characters, and emojis are treated as text, while invalid byte sequences remain correctly identified as binary.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the isBinaryFile utility to better handle UTF-8 encoded text by introducing a validation check for multibyte sequences, ensuring characters like the Unicode replacement character (U+FFFD) are not incorrectly flagged as non-printable. New tests have been added to verify this behavior. Review feedback recommends tightening the UTF-8 start byte validation to prevent overlong encodings and including the ASCII DEL character in the non-printable character count.

Comment thread packages/core/src/utils/fileUtils.ts Outdated
Comment on lines +291 to +299
if ((b & 0xe0) === 0xc0) {
expectedLen = 2;
} else if ((b & 0xf0) === 0xe0) {
expectedLen = 3;
} else if ((b & 0xf8) === 0xf0) {
expectedLen = 4;
} else {
return 0; // Not a valid leading byte
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current UTF-8 start byte validation is too permissive. It allows invalid start bytes such as 0xC0, 0xC1 (which are always overlong) and 0xF5 through 0xF7 (which would encode values above the Unicode limit of U+10FFFF). While this is a heuristic, using stricter ranges for the start bytes improves the accuracy of the binary detection by correctly identifying these invalid sequences as non-printable.

Suggested change
if ((b & 0xe0) === 0xc0) {
expectedLen = 2;
} else if ((b & 0xf0) === 0xe0) {
expectedLen = 3;
} else if ((b & 0xf8) === 0xf0) {
expectedLen = 4;
} else {
return 0; // Not a valid leading byte
}
if (b >= 0xc2 && b <= 0xdf) {
expectedLen = 2;
} else if (b >= 0xe0 && b <= 0xef) {
expectedLen = 3;
} else if (b >= 0xf0 && b <= 0xf4) {
expectedLen = 4;
} else {
return 0; // Not a valid leading byte
}
References
  1. When handling user input, prefer to be strict and throw an error for invalid or ambiguous cases rather than adding complex logic to support them.

Comment thread packages/core/src/utils/fileUtils.ts Outdated
continue;
}

if (buf[i] < 9 || (buf[i] > 13 && buf[i] < 32)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The non-printable character check is missing the ASCII DEL character (127 or 0x7F). This character is a control character and should be counted toward the nonPrintableCount to maintain the effectiveness of the binary detection heuristic.

Suggested change
if (buf[i] < 9 || (buf[i] > 13 && buf[i] < 32)) {
if (buf[i] < 9 || (buf[i] > 13 && buf[i] < 32) || buf[i] === 127) {
References
  1. When handling user input, prefer to be strict and throw an error for invalid or ambiguous cases rather than adding complex logic to support them.

@gemini-cli gemini-cli Bot added the area/core Issues related to User Interface, OS Support, Core Functionality label Apr 4, 2026
@knQzx
Copy link
Copy Markdown
Author

knQzx commented Apr 4, 2026

I have signed the CLA

@bdmorgan
Copy link
Copy Markdown
Collaborator

bdmorgan commented Apr 4, 2026

I have signed the CLA

Please see the code review comments up above

@knQzx
Copy link
Copy Markdown
Author

knQzx commented Apr 4, 2026

hi! sorry about that, fixed both

@scidomino
Copy link
Copy Markdown
Collaborator

I'm not a fan of hand-rolling this logic. Can you rework this to import a third party library like isbinaryfile or istextorbinary so we don't have to implement this logic ourselves?

@scidomino
Copy link
Copy Markdown
Collaborator

Also, assign yourself to #24547

@gemini-cli gemini-cli Bot added the help wanted We will accept PRs from all issues marked as "help wanted". Thanks for your support! label Apr 11, 2026
@knQzx knQzx requested a review from a team as a code owner April 16, 2026 15:50
@knQzx
Copy link
Copy Markdown
Author

knQzx commented Apr 16, 2026

reworked to use the isbinaryfile library as suggested - removed all hand-rolled UTF-8 validation logic. all 89 tests pass. @scidomino

@scidomino
Copy link
Copy Markdown
Collaborator

You have a bunch of branch conflicts. please fix

@knQzx
Copy link
Copy Markdown
Author

knQzx commented Apr 16, 2026

closing - this was already fixed upstream in #25297

@knQzx knQzx closed this Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/core Issues related to User Interface, OS Support, Core Functionality help wanted We will accept PRs from all issues marked as "help wanted". Thanks for your support! priority/p2 Important but can be addressed in a future release.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

'U+FFFD' character in file broken by replace tool

3 participants