Skip to content

fix(security): preserve leading BOM in strip_dangerous#372

Merged
danielmeppiel merged 3 commits intomicrosoft:mainfrom
dadavidtseng:fix/preserve-leading-bom-in-strip-dangerous
Mar 20, 2026
Merged

fix(security): preserve leading BOM in strip_dangerous#372
danielmeppiel merged 3 commits intomicrosoft:mainfrom
dadavidtseng:fix/preserve-leading-bom-in-strip-dangerous

Conversation

@dadavidtseng
Copy link
Contributor

Summary

ContentScanner.strip_dangerous() was stripping all BOM characters (U+FEFF), including a leading BOM at position 0. This contradicts its documented contract:

Info-level characters (emoji selectors, non-breaking spaces, unusual whitespace) are preserved — they are legitimate and stripping them would break content.

scan_text() correctly classifies a leading BOM as "info" severity (standard practice for UTF-8 files), but strip_dangerous() unconditionally stripped it anyway. When apm audit --strip runs on a file with a legitimate leading BOM, the BOM would be incorrectly removed.

Fix

  • Check whether the BOM is at position 0 before deciding to strip it
  • Leading BOM (info-level) is now preserved; mid-file BOMs (warning-level) are still stripped
  • Updated the corresponding test to assert the leading BOM is preserved

Files changed

  • src/apm_cli/security/content_scanner.py — 3-line fix in strip_dangerous()
  • tests/unit/test_content_scanner.py — updated test to match corrected behavior

Test plan

  • All 78 content scanner tests pass
  • All 41 audit command tests pass
  • Leading BOM at position 0 is preserved (info-level)
  • Mid-file BOM is still stripped (warning-level)

strip_dangerous() was stripping all BOM characters (U+FEFF), including
a leading BOM at position 0. This contradicts its documented contract
which states that info-level characters are preserved — and scan_text()
classifies a leading BOM as info severity since it is standard practice
for UTF-8 files.

The fix checks whether the BOM is at position 0 before deciding to
strip it. Mid-file BOMs (warning-level) are still stripped as before.

Updated the corresponding test to assert the leading BOM is preserved.
Copilot AI review requested due to automatic review settings March 19, 2026 16:42
@dadavidtseng
Copy link
Contributor Author

@microsoft-github-policy-service agree

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes ContentScanner.strip_dangerous() to preserve a legitimate leading BOM (U+FEFF) while still stripping mid-file BOMs, aligning behavior with the documented “info-level characters are preserved” contract.

Changes:

  • Update strip_dangerous() to keep BOM at index 0 and remove BOMs elsewhere.
  • Update the unit test to assert a leading BOM remains unchanged.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/apm_cli/security/content_scanner.py Adjust BOM handling in strip_dangerous() to preserve the leading BOM only.
tests/unit/test_content_scanner.py Update test to validate preservation of a leading BOM.

Restructure the BOM handling branch so the mid-file strip path uses an
early continue and the leading-BOM path falls through to the common
append, making the two behaviors self-documenting.

Co-authored-by: Copilot <copilot@github.com>
Copy link
Collaborator

@danielmeppiel danielmeppiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-scoped fix. strip_dangerous() now correctly preserves leading BOM (info-level) while still stripping mid-file BOM (warning-level), aligning with the documented security model contract.

Please ensure CI tests pass before merging. Thanks for the contribution! 🎉

@dadavidtseng
Copy link
Contributor Author

@danielmeppiel All CI checks have passed, but the Integration Tests (PR) required check is stuck at "Waiting for status to be reported" and was never triggered. Could you re-run it or advise on how to trigger it? Thanks!

@danielmeppiel
Copy link
Collaborator

It's a workflow gated on approval - it's running now, will merge. Thanks a lot for your contribution!

@danielmeppiel danielmeppiel merged commit 862c280 into microsoft:main Mar 20, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants