fix false positive binary detection for files containing U+FFFD#24685
fix false positive binary detection for files containing U+FFFD#24685knQzx wants to merge 4 commits intogoogle-gemini:mainfrom
Conversation
the binary detection heuristic was not utf-8 aware - high bytes (>= 0x80) were not being validated as utf-8 sequences, which could cause files containing valid multibyte characters like U+FFFD (the unicode replacement character) to be misclassified as binary. now we validate multibyte utf-8 sequences properly and only count genuinely invalid byte sequences toward the non-printable ratio.
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an issue where text files containing valid Unicode characters, specifically the replacement character (U+FFFD), were being incorrectly identified as binary files. By implementing a proper UTF-8 multibyte sequence validation, the binary detection logic now accurately distinguishes between valid text encoding and actual binary data, preventing crashes and improving file handling reliability. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request enhances the isBinaryFile utility to better handle UTF-8 encoded text by introducing a validation check for multibyte sequences, ensuring characters like the Unicode replacement character (U+FFFD) are not incorrectly flagged as non-printable. New tests have been added to verify this behavior. Review feedback recommends tightening the UTF-8 start byte validation to prevent overlong encodings and including the ASCII DEL character in the non-printable character count.
| if ((b & 0xe0) === 0xc0) { | ||
| expectedLen = 2; | ||
| } else if ((b & 0xf0) === 0xe0) { | ||
| expectedLen = 3; | ||
| } else if ((b & 0xf8) === 0xf0) { | ||
| expectedLen = 4; | ||
| } else { | ||
| return 0; // Not a valid leading byte | ||
| } |
There was a problem hiding this comment.
The current UTF-8 start byte validation is too permissive. It allows invalid start bytes such as 0xC0, 0xC1 (which are always overlong) and 0xF5 through 0xF7 (which would encode values above the Unicode limit of U+10FFFF). While this is a heuristic, using stricter ranges for the start bytes improves the accuracy of the binary detection by correctly identifying these invalid sequences as non-printable.
| if ((b & 0xe0) === 0xc0) { | |
| expectedLen = 2; | |
| } else if ((b & 0xf0) === 0xe0) { | |
| expectedLen = 3; | |
| } else if ((b & 0xf8) === 0xf0) { | |
| expectedLen = 4; | |
| } else { | |
| return 0; // Not a valid leading byte | |
| } | |
| if (b >= 0xc2 && b <= 0xdf) { | |
| expectedLen = 2; | |
| } else if (b >= 0xe0 && b <= 0xef) { | |
| expectedLen = 3; | |
| } else if (b >= 0xf0 && b <= 0xf4) { | |
| expectedLen = 4; | |
| } else { | |
| return 0; // Not a valid leading byte | |
| } |
References
- When handling user input, prefer to be strict and throw an error for invalid or ambiguous cases rather than adding complex logic to support them.
| continue; | ||
| } | ||
|
|
||
| if (buf[i] < 9 || (buf[i] > 13 && buf[i] < 32)) { |
There was a problem hiding this comment.
The non-printable character check is missing the ASCII DEL character (127 or 0x7F). This character is a control character and should be counted toward the nonPrintableCount to maintain the effectiveness of the binary detection heuristic.
| if (buf[i] < 9 || (buf[i] > 13 && buf[i] < 32)) { | |
| if (buf[i] < 9 || (buf[i] > 13 && buf[i] < 32) || buf[i] === 127) { |
References
- When handling user input, prefer to be strict and throw an error for invalid or ambiguous cases rather than adding complex logic to support them.
|
I have signed the CLA |
Please see the code review comments up above |
|
hi! sorry about that, fixed both |
|
I'm not a fan of hand-rolling this logic. Can you rework this to import a third party library like isbinaryfile or istextorbinary so we don't have to implement this logic ourselves? |
|
Also, assign yourself to #24547 |
|
reworked to use the isbinaryfile library as suggested - removed all hand-rolled UTF-8 validation logic. all 89 tests pass. @scidomino |
|
You have a bunch of branch conflicts. please fix |
|
closing - this was already fixed upstream in #25297 |
Summary
Test plan
fixes #24547