fix false positive binary detection for files containing U+FFFD by knQzx · Pull Request #24685 · google-gemini/gemini-cli

knQzx · 2026-04-04T15:20:39Z

Summary

files containing the unicode replacement character (U+FFFD) are incorrectly flagged as binary, causing a crash when trying to read valid source files (e.g. rust files that use this char intentionally)
replaced the naive high-byte heuristic with proper UTF-8 multibyte sequence validation — valid sequences (including U+FFFD encoded as EF BF BD) are now skipped, only truly invalid byte sequences count toward the non-printable ratio

Test plan

added test: file with literal U+FFFD chars → detected as text
added test: file with CJK + emoji + U+FFFD → detected as text
added test: file with invalid raw UTF-8 bytes → still detected as binary
all 89 existing tests pass

the binary detection heuristic was not utf-8 aware - high bytes (>= 0x80) were not being validated as utf-8 sequences, which could cause files containing valid multibyte characters like U+FFFD (the unicode replacement character) to be misclassified as binary. now we validate multibyte utf-8 sequences properly and only count genuinely invalid byte sequences toward the non-printable ratio.

google-cla · 2026-04-04T15:20:44Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gemini-code-assist · 2026-04-04T15:20:47Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where text files containing valid Unicode characters, specifically the replacement character (U+FFFD), were being incorrectly identified as binary files. By implementing a proper UTF-8 multibyte sequence validation, the binary detection logic now accurately distinguishes between valid text encoding and actual binary data, preventing crashes and improving file handling reliability.

Highlights

Improved Binary Detection Heuristic: Replaced the naive high-byte check with a robust UTF-8 multibyte sequence validation to correctly identify text files.
Fixed False Positives: Files containing the Unicode replacement character (U+FFFD) or other valid UTF-8 sequences are no longer incorrectly flagged as binary.
Enhanced Test Coverage: Added new test cases to verify that U+FFFD, CJK characters, and emojis are treated as text, while invalid byte sequences remain correctly identified as binary.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enhances the isBinaryFile utility to better handle UTF-8 encoded text by introducing a validation check for multibyte sequences, ensuring characters like the Unicode replacement character (U+FFFD) are not incorrectly flagged as non-printable. New tests have been added to verify this behavior. Review feedback recommends tightening the UTF-8 start byte validation to prevent overlong encodings and including the ASCII DEL character in the non-printable character count.

gemini-code-assist · 2026-04-04T15:23:19Z

+  if ((b & 0xe0) === 0xc0) {
+    expectedLen = 2;
+  } else if ((b & 0xf0) === 0xe0) {
+    expectedLen = 3;
+  } else if ((b & 0xf8) === 0xf0) {
+    expectedLen = 4;
+  } else {
+    return 0; // Not a valid leading byte
+  }


The current UTF-8 start byte validation is too permissive. It allows invalid start bytes such as 0xC0, 0xC1 (which are always overlong) and 0xF5 through 0xF7 (which would encode values above the Unicode limit of U+10FFFF). While this is a heuristic, using stricter ranges for the start bytes improves the accuracy of the binary detection by correctly identifying these invalid sequences as non-printable.

Suggested change

if ((b & 0xe0) === 0xc0) {

expectedLen = 2;

} else if ((b & 0xf0) === 0xe0) {

expectedLen = 3;

} else if ((b & 0xf8) === 0xf0) {

expectedLen = 4;

} else {

return 0; // Not a valid leading byte

}

if (b >= 0xc2 && b <= 0xdf) {

expectedLen = 2;

} else if (b >= 0xe0 && b <= 0xef) {

expectedLen = 3;

} else if (b >= 0xf0 && b <= 0xf4) {

expectedLen = 4;

} else {

return 0; // Not a valid leading byte

}

References

When handling user input, prefer to be strict and throw an error for invalid or ambiguous cases rather than adding complex logic to support them.

gemini-code-assist · 2026-04-04T15:23:20Z

+        continue;
+      }
+
      if (buf[i] < 9 || (buf[i] > 13 && buf[i] < 32)) {


The non-printable character check is missing the ASCII DEL character (127 or 0x7F). This character is a control character and should be counted toward the nonPrintableCount to maintain the effectiveness of the binary detection heuristic.

Suggested change

if (buf[i] < 9 || (buf[i] > 13 && buf[i] < 32)) {

if (buf[i] < 9 || (buf[i] > 13 && buf[i] < 32) || buf[i] === 127) {

References

When handling user input, prefer to be strict and throw an error for invalid or ambiguous cases rather than adding complex logic to support them.

knQzx · 2026-04-04T15:41:57Z

I have signed the CLA

bdmorgan · 2026-04-04T15:43:47Z

I have signed the CLA

Please see the code review comments up above

knQzx · 2026-04-04T15:49:48Z

hi! sorry about that, fixed both

scidomino · 2026-04-11T19:37:46Z

I'm not a fan of hand-rolling this logic. Can you rework this to import a third party library like isbinaryfile or istextorbinary so we don't have to implement this logic ourselves?

scidomino · 2026-04-11T19:43:22Z

Also, assign yourself to #24547

knQzx · 2026-04-16T15:51:14Z

reworked to use the isbinaryfile library as suggested - removed all hand-rolled UTF-8 validation logic. all 89 tests pass. @scidomino

scidomino · 2026-04-16T16:29:46Z

You have a bunch of branch conflicts. please fix

knQzx · 2026-04-16T17:49:49Z

closing - this was already fixed upstream in #25297

knQzx requested a review from a team as a code owner April 4, 2026 15:20

gemini-code-assist Bot reviewed Apr 4, 2026

View reviewed changes

gemini-cli Bot added the area/core Issues related to User Interface, OS Support, Core Functionality label Apr 4, 2026

retrigger ci

d89b4a2

tighten utf-8 validation and add DEL to non-printable check

2fc7519

github-actions Bot mentioned this pull request Apr 5, 2026

📊 AI CLI 工具社区动态日报 2026-04-05 gsscsd/big_model_radar#136

Open

gemini-cli Bot added the priority/p2 Important but can be addressed in a future release. label Apr 10, 2026

github-actions Bot mentioned this pull request Apr 11, 2026

📊 AI CLI 工具社区动态日报 2026-04-11 gsscsd/big_model_radar#167

Open

nazmulidris mentioned this pull request Apr 11, 2026

'U+FFFD' character in file broken by replace tool #24547

Closed

gemini-cli Bot added the help wanted We will accept PRs from all issues marked as "help wanted". Thanks for your support! label Apr 11, 2026

github-actions Bot mentioned this pull request Apr 12, 2026

📊 AI CLI 工具社区动态日报 2026-04-12 gsscsd/big_model_radar#172

Open

replace hand-rolled binary detection with isbinaryfile library

1f5b0ed

knQzx requested a review from a team as a code owner April 16, 2026 15:50

knQzx closed this Apr 16, 2026

	if (buf[i] < 9 \|\| (buf[i] > 13 && buf[i] < 32)) {
	if (buf[i] < 9 \|\| (buf[i] > 13 && buf[i] < 32) \|\| buf[i] === 127) {

Conversation

knQzx commented Apr 4, 2026

Summary

Test plan

Uh oh!

google-cla Bot commented Apr 4, 2026

Uh oh!

gemini-code-assist Bot commented Apr 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

knQzx commented Apr 4, 2026

Uh oh!

bdmorgan commented Apr 4, 2026

Uh oh!

knQzx commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scidomino commented Apr 11, 2026

Uh oh!

scidomino commented Apr 11, 2026

Uh oh!

knQzx commented Apr 16, 2026

Uh oh!

scidomino commented Apr 16, 2026

Uh oh!

knQzx commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

knQzx commented Apr 4, 2026 •

edited

Loading