Revert strip UTF-8 BOM strip by J535D165 · Pull Request #64 · MrTango/rispy

J535D165 · 2024-08-26T21:23:43Z

In this PR, I propose to revert @holub008's fix of issue #13 in PR #23. The work might no longer be needed and also negatively impacts rispy's performance.

The original issue reports that Endnote uses UTF-8-BOM for encoding. I can't reproduce that anymore, so this might no longer be true. Rispy has nowadays very good support for handling encoding, so I propose to remove support.

holub008 · 2024-08-26T22:01:18Z

Regardless of programs prepending a BOM any longer, reading in the file with utf-8-sig will strip it & is cleaner code. For that it has my vote!

However, this change is backwards incompatible on public API. I doubt clean_start is being implemented by many users, but it's a either a gamble on a minor version bump (which will silently succeed without the expected code running in some cases!) or requires a major version bump. I suspect the performance rationale for making this change is weak, since calling lstrip on one line of the file is orders of magnitude faster than the entire parse.

PeterLombaers · 2024-08-27T08:08:06Z

You could do something like:

if lines:
    lines[0] = self.clean_start(lines[0])
for line in lines:
...

Then the BOM character still gets stripped and there definitely won't be an impact on perfomance. (I can imagine that Python is smart enough to make this change in the background anyway.)

J535D165 · 2024-08-27T13:25:26Z

Good suggestions. I propose a bigger refactor in #66, and therefore I skipped the idea of @PeterLombaers. I think it's also time for some breaking changes now, so I'm not to worried about that (we are still in 0.x).

shapiromatron

Looks good. I had to do a little research for myself about encodings, so wanted to drop this link in the review to understand what utf-8-sig does:

https://docs.python.org/3/library/codecs.html

As UTF-8 is an 8-bit encoding no BOM is required and any U+FEFF character in the decoded string (even if it’s the first character) is treated as a ZERO WIDTH NO-BREAK SPACE.
Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to
LATIN SMALL LETTER I WITH DIAERESIS
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
INVERTED QUESTION MARK
in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

Revert strip UTF-8 BOM strip

c0e8614

J535D165 added 2 commits August 27, 2024 14:28

Fix test

05426a6

Fix doctest

ed804a1

J535D165 mentioned this pull request Aug 27, 2024

Improve performance of rispy's parser and refactor parser #66

Merged

shapiromatron self-assigned this Mar 26, 2025

shapiromatron approved these changes Apr 11, 2025

View reviewed changes

shapiromatron merged commit 7e38b93 into MrTango:main Apr 16, 2025

J535D165 deleted the revert-bom-removal branch April 17, 2025 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert strip UTF-8 BOM strip#64

Revert strip UTF-8 BOM strip#64
shapiromatron merged 3 commits intoMrTango:mainfrom
J535D165:revert-bom-removal

J535D165 commented Aug 26, 2024

Uh oh!

holub008 commented Aug 26, 2024

Uh oh!

PeterLombaers commented Aug 27, 2024

Uh oh!

J535D165 commented Aug 27, 2024

Uh oh!

shapiromatron left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

J535D165 commented Aug 26, 2024

Uh oh!

holub008 commented Aug 26, 2024

Uh oh!

PeterLombaers commented Aug 27, 2024

Uh oh!

J535D165 commented Aug 27, 2024

Uh oh!

shapiromatron left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants