Skip to content

fix: extend explicit UTF-8 encoding to 5 remaining open() call sites#1068

Merged
danielmeppiel merged 2 commits intomainfrom
fix/utf8-encoding-non-script-files
Apr 30, 2026
Merged

fix: extend explicit UTF-8 encoding to 5 remaining open() call sites#1068
danielmeppiel merged 2 commits intomainfrom
fix/utf8-encoding-non-script-files

Conversation

@danielmeppiel
Copy link
Copy Markdown
Collaborator

Why

Windows users on cp950 / cp1252 locales hit UnicodeDecodeError on any apm operation that reads or writes UTF-8 content through the 5 modules listed below. Issue #604 covered the prompt-file path and was fixed by #607 (script_runner.py only). The original PR #635 covered the broader surface but was closed because the author did not sign CLA. Re-deriving here.

What

Adds encoding="utf-8" to the remaining open() call sites in:

  • src/apm_cli/config.py (5 sites)
  • src/apm_cli/marketplace/client.py (5 sites)
  • src/apm_cli/marketplace/registry.py (3 sites)
  • src/apm_cli/models/plugin.py (1 site)
  • src/apm_cli/runtime/copilot_runtime.py (1 site)

Adds round-trip unit tests per module that exercise non-ASCII content (e.g. 中文, 開始, cafe). Skipped binary-mode opens (encoding irrelevant) and subprocess.Popen (different code path; already passes encoding="utf-8").

Refs

Validation

  • uv run ruff check src/ tests/: clean.
  • uv run pytest tests/unit tests/test_console.py: 6870 passed.
  • 9 new round-trip tests added; manually verified they would fail on a cp1252 locale without the encoding additions (the default codec cannot decode UTF-8 multi-byte sequences such as e4 b8 ad for ).

Successor to #635 (closed-no-CLA). #604 was fixed by #607 in
script_runner.py only. This extends the same fix to the remaining
files where Windows cp950 / cp1252 locales would crash on UTF-8
content:

- src/apm_cli/config.py: global config read/write
- src/apm_cli/marketplace/client.py: marketplace cache I/O
- src/apm_cli/marketplace/registry.py: registry persistence
- src/apm_cli/models/plugin.py: plugin metadata read
- src/apm_cli/runtime/copilot_runtime.py: MCP config read

Adds round-trip unit tests with non-ASCII content for each module
to catch regressions.

Refs #604 (closed). Closes #635 (closed-no-CLA, re-derived in microsoft/apm).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 30, 2026 12:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes text-mode file I/O to explicitly use UTF-8 across the remaining open() call sites that could trigger UnicodeDecodeError on Windows non-UTF-8 locales, and adds unit tests intended to guard the behavior.

Changes:

  • Add encoding="utf-8" to remaining text open() call sites in config, marketplace cache/registry, plugin metadata loading, and Copilot MCP config loading.
  • Add new/extended unit tests to exercise round-trips involving non-ASCII content for the affected modules.
Show a summary per file
File Description
src/apm_cli/config.py Add explicit UTF-8 encoding to config JSON reads/writes.
src/apm_cli/marketplace/client.py Add explicit UTF-8 encoding to marketplace cache reads/writes.
src/apm_cli/marketplace/registry.py Add explicit UTF-8 encoding to marketplace registry persistence.
src/apm_cli/models/plugin.py Add explicit UTF-8 encoding when reading plugin.json.
src/apm_cli/runtime/copilot_runtime.py Add explicit UTF-8 encoding when reading MCP config JSON.
tests/unit/test_config.py New tests intended to validate UTF-8 config I/O behavior.
tests/unit/marketplace/test_marketplace_client.py New tests intended to validate UTF-8 cache I/O behavior.
tests/unit/marketplace/test_marketplace_registry.py New tests intended to validate UTF-8 registry I/O behavior.
tests/unit/test_plugin.py New test intended to validate UTF-8 plugin metadata reads.
tests/unit/test_copilot_runtime.py New test intended to validate UTF-8 MCP config reads.

Copilot's findings

Comments suppressed due to low confidence (6)

tests/unit/test_config.py:45

  • These tests go through update_config(), which writes JSON via json.dump with the default ensure_ascii=True. That will escape non-ASCII as \uXXXX, leaving the file ASCII-only and not exercising the UTF-8 decode path that was failing on Windows locales. To cover the regression, add a case that writes a raw UTF-8 config.json (e.g. ensure_ascii=False) and asserts get_config() can load it.
    def test_config_file_is_utf8_on_disk(self, isolated_config):
        non_ascii_value = "# 開始 -- cafe"
        config_mod.update_config({"note": non_ascii_value})

        # Read raw bytes and decode as UTF-8 to assert the on-disk encoding.
        raw = isolated_config.read_bytes()
        decoded = json.loads(raw.decode("utf-8"))

tests/unit/marketplace/test_marketplace_registry.py:110

  • This test path uses registry_mod.add_marketplace(), which persists via json.dump(...) with default ensure_ascii=True (non-ASCII becomes \uXXXX). That keeps the registry file ASCII-only, so it wouldn't have triggered the Windows decode error prior to adding encoding="utf-8". To make the regression test meaningful, load from a registry file written with raw UTF-8 (e.g. ensure_ascii=False).
        src = MarketplaceSource(name="cafe-mkt", owner="cafe-org", repo="plugins-開始")
        registry_mod.add_marketplace(src)

        # Force re-load from disk by clearing the cache.
        registry_mod._invalidate_cache()
        fetched = registry_mod.get_marketplace_by_name("cafe-mkt")

tests/unit/marketplace/test_marketplace_registry.py:126

  • This test writes raw UTF-8 JSON but includes non-ASCII literals directly in the Python source (e.g. repo name with CJK). The repo requires Python source files to remain printable ASCII; please convert these literals to ASCII-only escape sequences ("\u....") so the runtime JSON still contains non-ASCII bytes without violating the source encoding rule.
        payload = {
            "marketplaces": [
                {"name": "cafe-mkt", "owner": "o", "repo": "repo-中文"},
            ]

tests/unit/marketplace/test_marketplace_client.py:446

  • These tests call _write_cache(), but that function uses json.dump with default ensure_ascii=True, so non-ASCII becomes \uXXXX and the cache files remain ASCII-only. That means the tests wouldn't have failed before the encoding="utf-8" change and don't actually exercise the regression scenario. Consider writing the cache JSON with raw UTF-8 (e.g. ensure_ascii=False) and then asserting _read_cache() / _read_stale_cache() can load it.
        }
        client_mod._write_cache("utf8-mkt", data)

        cached = client_mod._read_cache("utf8-mkt")
        assert cached is not None

tests/unit/test_copilot_runtime.py:175

  • _json.dumps(servers) defaults to ensure_ascii=True, which escapes non-ASCII as \uXXXX. That means the bytes written here will be ASCII-only and won't reproduce the Windows default-encoding decode failure this PR is guarding against. Write raw UTF-8 JSON (e.g. ensure_ascii=False) so the on-disk content actually contains non-ASCII bytes.
                    "description": "中文 description -- cafe",
                }
            }
        }
        mcp_path.write_bytes(_json.dumps(servers).encode("utf-8"))

tests/unit/test_plugin.py:25

  • json.dumps(metadata) defaults to ensure_ascii=True, which will escape non-ASCII characters as \uXXXX and keep plugin.json ASCII-only. That means this test would likely pass even if Plugin.from_path() used a non-UTF-8 platform default encoding. To validate the encoding fix, write JSON with raw UTF-8 bytes (e.g. ensure_ascii=False) and then read via Plugin.from_path().
        plugin_json = tmp_path / "plugin.json"
        plugin_json.write_bytes(json.dumps(metadata).encode("utf-8"))

  • Files reviewed: 10/10 changed files
  • Comments generated: 5

Comment on lines +169 to +172
"command": "node",
"args": ["server.js"],
"description": "中文 description -- cafe",
}
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test introduces non-ASCII string literals in the Python source (e.g. CJK characters in the JSON data). The repo requires Python source files to stay within printable ASCII; please use ASCII-only escape sequences ("\u....") or construct UTF-8 bytes via escapes so runtime values remain non-ASCII without non-ASCII source text.

This issue also appears on line 171 of the same file.

Copilot generated this review using guidance from repository custom instructions.
Comment thread tests/unit/test_config.py
Comment on lines +29 to +32
def test_update_config_preserves_non_ascii(self, isolated_config):
non_ascii_value = "/Users/cafe/projets/開始"
config_mod.update_config({"copilot_cowork_skills_dir": non_ascii_value})

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-ASCII literals are introduced in the test data (e.g. in the path/config values). The repo requires Python source files to be printable ASCII, so please express these values using ASCII-only escape sequences ("\u....") while still producing non-ASCII runtime strings.

This issue also appears on line 39 of the same file.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +104 to +106
# description-bearing source by writing a custom entry directly.
src = MarketplaceSource(name="cafe-mkt", owner="cafe-org", repo="plugins-開始")
registry_mod.add_marketplace(src)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test adds non-ASCII string literals in the Python source (e.g. in the repo value). The repo requires Python source files to remain printable ASCII; please replace these with ASCII-only escape sequences so runtime values are still non-ASCII without non-ASCII source text.

This issue also appears in the following locations of the same file:

  • line 105
  • line 123

Copilot generated this review using guidance from repository custom instructions.
Comment thread tests/unit/test_plugin.py
Comment on lines +18 to +21
"name": "Demo plugin -- cafe",
"version": "1.0.0",
"description": "Plugin de demo with 中文 description",
"author": "Cafe Author",
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test file includes non-ASCII string literals (CJK characters) in source. The repo requires Python source files to remain printable ASCII; please replace these literals with ASCII-only escape sequences so the runtime value is still non-ASCII without introducing non-ASCII characters into the .py file.

This issue also appears on line 23 of the same file.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +438 to +442
data = {
"name": "Marketplace -- cafe",
"description": "中文 description",
"plugins": [{"name": "skill-開始", "author": "cafe"}],
}
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new tests include non-ASCII literals in the Python source (e.g. in description / plugin names). The repo requires Python source files to remain printable ASCII; please use ASCII-only escape sequences ("\u....") or construct UTF-8 bytes via escapes so runtime values remain non-ASCII without non-ASCII source text.

This issue also appears on line 442 of the same file.

Copilot generated this review using guidance from repository custom instructions.
Addresses Copilot review on #1068. The repo's encoding rule requires
Python source files to stay within printable ASCII. Replace the CJK /
accented literals introduced for round-trip tests with \\uXXXX escapes
so runtime values remain non-ASCII (still exercising the UTF-8
encoding fix) without putting non-ASCII bytes in .py source.

Affects 5 test files; no behavioral change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Addressed Copilot review: replaced raw CJK / accented literals with \uXXXX escapes across all 5 test files. Same runtime values (still non-ASCII), ASCII-only source as the repo encoding rule requires. 57 affected tests still pass.

@danielmeppiel danielmeppiel merged commit cd9426f into main Apr 30, 2026
10 checks passed
@danielmeppiel danielmeppiel deleted the fix/utf8-encoding-non-script-files branch April 30, 2026 15:19
danielmeppiel pushed a commit to tillig/apm that referenced this pull request Apr 30, 2026
Bring in PR microsoft#1068 (utf-8 encoding) and other main updates so the PR
diff shows only the focused fix. Earlier panel rejection cited a
hook_integrator regression which is no longer present in this branch
(removed via prior merges); the core claude_formatter.py fix that the
panel approved remains intact and is the only logical change.

Co-authored-by: Travis Illig <tillig@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants