Skip to content

Standardize Unicode encodings to explicitly use UTF-8#635

Closed
dvmukul wants to merge 5 commits intomicrosoft:mainfrom
dvmukul:fix/unicode-decoding-windows
Closed

Standardize Unicode encodings to explicitly use UTF-8#635
dvmukul wants to merge 5 commits intomicrosoft:mainfrom
dvmukul:fix/unicode-decoding-windows

Conversation

@dvmukul
Copy link
Copy Markdown

@dvmukul dvmukul commented Apr 9, 2026

Summary of Changes

This Pull Request standardizes file operations across the codebase by adding explicit encoding='utf-8' to open() calls.

Rationale

On Windows, the default system encoding is often locale-specific (e.g., CP1252 or CP950), which causes UnicodeDecodeError when reading or writing files that contain UTF-8 characters (like prompt templates or localized marketplace metadata). This fix ensures that apm behaves consistently across Windows, macOS, and Linux.

Key Fixes:

Verification

Changes were audited for text-based text vs binary-based operations to avoid data corruption. Existing unit tests were prioritized to ensure no regressions in core functionality.

This change adds encoding='utf-8' to file open operations in the script runner, configuration manager, marketplace client, and other core modules. This resolves reported issues (e.g., microsoft#604) where UnicodeDecodeError occurs on Windows when handling prompts and configuration files with non-ASCII characters.
Copy link
Copy Markdown
Collaborator

@sergio-sisternes-epam sergio-sisternes-epam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall — the changes are correct and directly fix the Windows UnicodeDecodeError from #604. Two suggestions in the inline comments (missed open() calls + test coverage).

Comment thread src/apm_cli/config.py

if not os.path.exists(CONFIG_FILE):
with open(CONFIG_FILE, "w") as f:
with open(CONFIG_FILE, "w", encoding="utf-8") as f:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Good fix. Note that an AST scan of the codebase found five more text-mode open() calls that still lack explicit encoding='utf-8':

File Line Operation
adapters/client/codex.py 65 toml.dump() — write config
adapters/client/codex.py 80 toml.load() — read config
adapters/client/copilot.py 68 json.dump() — write config
adapters/client/copilot.py 83 json.load() — read config
deps/github_downloader.py 184 write empty gitconfig

Since the PR title says "Standardize … across the codebase," it would be great to cover these too (same pattern, same Windows risk).

self.compiled_dir.mkdir(parents=True, exist_ok=True)

with open(prompt_path, "r") as f:
with open(prompt_path, "r", encoding="utf-8") as f:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core fix for #604. Consider adding a small test that round-trips a non-ASCII string (e.g. CJK characters) through PromptCompiler.compile() to lock in the fix — the existing test_cli_encoding.py only covers console stream reconfiguration, not file I/O.

@danielmeppiel danielmeppiel requested a review from Copilot April 15, 2026 20:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Standardizes text file I/O across apm_cli by explicitly using UTF-8 encodings when reading/writing JSON and configuration files, improving cross-platform behavior (notably Windows default encodings).

Changes:

  • Add encoding="utf-8" to open() calls that read/write JSON config and cache files.
  • Ensure plugin metadata and MCP server configuration are read consistently as UTF-8.
  • Standardize marketplace registry persistence to UTF-8 for both reads and writes.
Show a summary per file
File Description
src/apm_cli/runtime/copilot_runtime.py Reads MCP config JSON using explicit UTF-8 encoding.
src/apm_cli/models/plugin.py Reads plugin.json metadata using explicit UTF-8 encoding.
src/apm_cli/marketplace/registry.py Reads/writes marketplace registry JSON using explicit UTF-8 encoding (including temp file).
src/apm_cli/marketplace/client.py Reads/writes marketplace cache JSON/meta using explicit UTF-8 encoding.
src/apm_cli/config.py Reads/writes global config JSON using explicit UTF-8 encoding.

Copilot's findings

  • Files reviewed: 5/5 changed files
  • Comments generated: 0

@danielmeppiel danielmeppiel added CI/CD Deprecated: use area/ci-cd. Kept for issue history; will be removed in milestone 0.10.0. and removed CI/CD Deprecated: use area/ci-cd. Kept for issue history; will be removed in milestone 0.10.0. labels Apr 19, 2026
@danielmeppiel
Copy link
Copy Markdown
Collaborator

@dvmukul without you accepting the CLA as per the bot comment, we cannot merge this

danielmeppiel added a commit that referenced this pull request Apr 30, 2026
…1068)

* fix: add explicit UTF-8 encoding to open() calls across 5 modules

Successor to #635 (closed-no-CLA). #604 was fixed by #607 in
script_runner.py only. This extends the same fix to the remaining
files where Windows cp950 / cp1252 locales would crash on UTF-8
content:

- src/apm_cli/config.py: global config read/write
- src/apm_cli/marketplace/client.py: marketplace cache I/O
- src/apm_cli/marketplace/registry.py: registry persistence
- src/apm_cli/models/plugin.py: plugin metadata read
- src/apm_cli/runtime/copilot_runtime.py: MCP config read

Adds round-trip unit tests with non-ASCII content for each module
to catch regressions.

Refs #604 (closed). Closes #635 (closed-no-CLA, re-derived in microsoft/apm).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test: use unicode escapes instead of raw non-ASCII literals

Addresses Copilot review on #1068. The repo's encoding rule requires
Python source files to stay within printable ASCII. Replace the CJK /
accented literals introduced for round-trip tests with \\uXXXX escapes
so runtime values remain non-ASCII (still exercising the UTF-8
encoding fix) without putting non-ASCII bytes in .py source.

Affects 5 test files; no behavioral change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: UnicodeDecodeError reading .prompt.md on Windows CP950 — open() missing encoding parameter

4 participants