Skip to content

fix(update): sanitise env before spawning installer#899

Merged
danielmeppiel merged 3 commits intomicrosoft:mainfrom
edenfunf:fix/update-subprocess-env-894
Apr 24, 2026
Merged

fix(update): sanitise env before spawning installer#899
danielmeppiel merged 3 commits intomicrosoft:mainfrom
edenfunf:fix/update-subprocess-env-894

Conversation

@edenfunf
Copy link
Copy Markdown
Contributor

Description

apm update inherits the PyInstaller bootloader's LD_LIBRARY_PATH when spawning the platform installer. The shell -- and the curl / tar / sudo calls install.sh makes -- then dlopens libssl.so.3 / libcrypto.so.3 from the bundle's _internal/ directory instead of the system ones. When the bundled libs are ABI-incompatible with what the system libcurl needs, curl aborts with OPENSSL_3.2.0 not found on the very first release fetch, blocking the upgrade path for every user on an affected distro (Debian trixie arm64 dev-containers, Fedora 43, and similar).

Fixes #894

Fix

New helper apm_cli.utils.subprocess_env.external_process_env centralises PyInstaller env sanitisation:

  • Restores LD_LIBRARY_PATH / DYLD_LIBRARY_PATH / DYLD_FRAMEWORK_PATH from the <NAME>_ORIG snapshots that PyInstaller's bootloader saves at launch.
  • Drops the variable entirely when no _ORIG snapshot exists (no pre-launch value to restore).
  • Strips the _ORIG keys from the returned env so PyInstaller internals do not leak to the child.
  • No-op outside a frozen build and on Windows.

apm update now calls subprocess.run with env=external_process_env() so the installer runs against the user's pre-launch environment. Restoring from _ORIG rather than blindly popping preserves legitimate user exports (CUDA, Nix, custom toolchains).

Complements #466's build-side exclude: that fix stopped new binaries from shipping the offending libs; this fix stops them from being inherited by spawned children even when they are present in older binaries or in any future bundle that re-introduces a similar dependency.

Type of change

  • Bug fix
  • New feature
  • Documentation
  • Maintenance / refactor

Testing

  • Tested locally
  • All existing tests pass
  • Added tests for new functionality (if applicable)

Unit coverage

  • tests/unit/test_subprocess_env.py -- 11 tests locking in the helper contract: no-op when not frozen, _ORIG restoration, drop when no _ORIG, DYLD variants, immutability of input mapping and os.environ, base-mapping precedence.
  • tests/unit/test_update_command.py -- two regression guards asserting the installer is always spawned with an explicit env= kwarg (Unix and Windows paths).

Full unit suite: 5308 passed.

End-to-end reproduction on WSL Ubuntu 22.04

  1. Placed deliberately broken libssl.so.3 / libcrypto.so.3 in a fake _internal/ dir.
  2. With LD_LIBRARY_PATH pointed at that dir, curl https://api.github.com/... failed with error while loading shared libraries: libssl.so.3: file too short -- the same dlopen-failure class as [BUG] apm update fails #894.
  3. With sys.frozen=True plus the real external_process_env() applied to the subprocess env, the same curl call returned http_code=200, while the main process still saw the polluted LD_LIBRARY_PATH -- proving the helper does not mutate the live environment.
  4. Outside a frozen build the helper is a no-op and curl still fails -- confirming dev-environment behaviour is untouched.

Note on upgrade path from pre-0.9.3 binaries

This fix takes effect from the next release onwards. Users already on an affected binary (0.8.5 or earlier in an environment whose system libcurl needs OPENSSL_3.2.0+) cannot apm update their way out, because their running binary lacks the fix. install.sh already points such users at the pip fallback (pip install --user apm-cli), which is a one-time escape. From 0.9.3+ onwards, apm update is immune to this bug class.

apm update inherits the PyInstaller bootloader's LD_LIBRARY_PATH when
spawning the platform installer. The shell -- and the curl / tar / sudo
calls install.sh makes -- then dlopens libssl.so.3 / libcrypto.so.3 from
the bundle's _internal/ directory instead of the system ones. When the
bundled libs are ABI-incompatible with what the system libcurl needs,
curl aborts with "OPENSSL_3.2.0 not found" on the very first release
fetch, blocking the upgrade path for every user on an affected distro
(Debian trixie arm64 dev-containers, Fedora 43, and similar).

Centralise PyInstaller env sanitisation in a new helper,
apm_cli.utils.subprocess_env.external_process_env, which restores
LD_LIBRARY_PATH / DYLD_LIBRARY_PATH / DYLD_FRAMEWORK_PATH from the
<NAME>_ORIG snapshots that PyInstaller's bootloader saves at launch, or
drops them entirely when no snapshot exists. The _ORIG keys are stripped
from the returned env so PyInstaller internals do not leak to the child.
Outside a frozen build and on Windows the helper is a no-op.

apm update now calls subprocess.run with env=external_process_env() so
the installer runs against the user's pre-launch environment. Restoring
from _ORIG rather than blindly popping preserves legitimate user
exports (CUDA, Nix, custom toolchains).

Complements microsoft#466's build-side exclude: that fix stopped new binaries
from shipping the offending libs; this fix stops them from being
inherited by spawned children even when they are present in older
binaries or in any future bundle that re-introduces a similar
dependency.

Closes microsoft#894
@danielmeppiel danielmeppiel added the panel-review Trigger the apm-review-panel gh-aw workflow label Apr 24, 2026
@github-actions
Copy link
Copy Markdown

APM Review Panel Verdict

Disposition: APPROVE (with one recommended follow-up tracked below)


Per-persona findings

Python Architect:

This is a routine PR -- a new utility module in src/apm_cli/utils/ backed by a targeted call-site fix in commands/update.py. The change is entirely procedural (no new classes).

1. OO / class diagram

classDiagram
    direction LR
    class subprocess_env {
        <<Pure>>
        +_PYINSTALLER_MANAGED_LIBRARY_VARS tuple
        +external_process_env(base) dict
    }
    class update {
        <<IOBoundary>>
        +update() command
        +_get_installer_run_command() list
    }
    class os_environ {
        <<External>>
    }
    class sys_frozen {
        <<External>>
    }
    subprocess_env ..> os_environ : reads copy
    subprocess_env ..> sys_frozen : checks getattr
    update ..> subprocess_env : uses external_process_env()
    class subprocess_env:::touched
    class update:::touched
    classDef touched fill:#fff3b0,stroke:#d47600
Loading

2. Execution flow diagram

flowchart TD
    A["user: apm update"] --> B["update.py: update()"]
    B --> C["[NET] download installer via requests.get"]
    C --> D["[FS] write to NamedTemporaryFile"]
    D --> E["external_process_env() -- subprocess_env.py"]
    E --> F{"getattr(sys, 'frozen', False)"}
    F -- "False (dev / source install)" --> G["return dict(os.environ) -- no-op"]
    F -- "True (PyInstaller binary)" --> H["iterate _PYINSTALLER_MANAGED_LIBRARY_VARS"]
    H --> I{"key_ORIG in env?"}
    I -- "Yes -- user had prior export" --> J["env[key] = env[key_ORIG]; pop _ORIG"]
    I -- "No -- PyInstaller injected it" --> K["env.pop(key, None)"]
    J --> L["[EXEC] subprocess.run(cmd, env=sanitised_env, check=False)"]
    K --> L
    G --> L
    L --> M["[EXEC] install.sh spawns system curl / tar / sudo"]
    M --> N["system binaries resolve libs from system paths"]
Loading

Design patterns

  • Used in this PR: none -- straight-line procedural code with a guard clause, appropriate for a cross-cutting utility helper at this scope.
  • Pragmatic suggestion: none -- the current shape is the simplest correct design at this scope. A class would add indirection with no benefit.

Quality observations:

  1. Placement in utils/ is correct -- consistent with other cross-cutting helpers.
  2. base: Mapping[str, str] | None makes the function testable without mocking os.environ at module level -- good API design.
  3. getattr(sys, "frozen", False) is the idiomatic PyInstaller detection idiom.
  4. Stripping _ORIG keys from the returned env avoids leaking PyInstaller internals to child processes.
  5. The _PYINSTALLER_MANAGED_LIBRARY_VARS tuple is intentionally narrow and correctly documented as such.

CLI Logging Expert: No output path changes anywhere in the diff. The fix is entirely below the logging layer -- subprocess.run() gets a different env= dict, but the logger.progress("Running installer...", symbol="gear") line is preserved and no new _rich_* calls or CommandLogger paths are introduced. No anti-patterns detected. No action required.


DevX UX Expert: No CLI surface changes. apm update has exactly the same flags, help text, exit codes, and output as before. The fix silently resolves a failure mode that previously printed a cryptic libssl.so.3: version 'OPENSSL_3.2.0' not found error from deep inside the installer script -- a scenario where the user had no actionable next step. From a UX standpoint this is a pure improvement: a previously broken command now works. The CHANGELOG entry is written for users, not maintainers. No action required.


Supply Chain Security Expert: The fix is security-positive: it narrows the child process's effective environment by removing PyInstaller's bundled library-path overrides. No auth variables (GH_TOKEN, GITHUB_TOKEN, GITHUB_APM_PAT, etc.) are touched -- the helper is scoped exclusively to LD_LIBRARY_PATH, DYLD_LIBRARY_PATH, and DYLD_FRAMEWORK_PATH. The _ORIG restoration reads from values the PyInstaller bootloader sets at process start (before any user input), so there is no injection surface.

Flag (non-blocking, follow-up): The module docstring correctly states this helper is "the single source of truth for child-process environment sanitisation", but the PR applies it to exactly one call site (update.py). A grep of the codebase reveals approximately 15 other subprocess.run() / Popen call sites that do NOT use external_process_env():

  • src/apm_cli/runtime/manager.py (3 sites) -- npm/node runtime spawning
  • src/apm_cli/runtime/copilot_runtime.py, codex_runtime.py, llm_runtime.py -- MCP server and LLM agent spawning
  • src/apm_cli/core/azure_cli.py (2 sites) -- az account get-access-token subprocess
  • src/apm_cli/core/token_manager.py -- git credential fill subprocess
  • src/apm_cli/deps/github_downloader.py -- git operations
  • src/apm_cli/commands/_helpers.py (2 sites)

The PR's docstring explicitly references issue #462 (the git case on Fedora 43) as a prior regression from the same root cause. Those other sites may share the same vulnerability. This should be tracked as a dedicated follow-up issue, not a blocker for this PR.


Auth Expert: Not activated -- the PR touches only src/apm_cli/utils/subprocess_env.py (new) and src/apm_cli/commands/update.py (env arg addition); no fast-path auth files are modified, and the only environment variables the helper acts on are LD_LIBRARY_PATH, DYLD_LIBRARY_PATH, and DYLD_FRAMEWORK_PATH -- none of which are auth credentials or token sources.


OSS Growth Hacker: This fix removes a trust-destroying failure mode on two fast-growing developer segments: arm64 dev-containers (Debian trixie, GitHub Codespaces arm64) and Fedora 43. Developers setting up AI-native tooling in containers are exactly the users APM needs to convert to repeat users -- and a broken apm update with a cryptic linker error is a silent churn trigger.

Side-channel to CEO: The CHANGELOG entry is already well-written for a release note beat. Worth surfacing in the next release post as a reliability signal: "APM update now works on modern Linux distros including arm64 dev-containers." That is a concrete, repostable claim with a platform callout. No growth-strategy.md update required for a point fix of this scope.


CEO arbitration

All five mandatory specialists are in alignment: this is a correct, well-scoped, well-tested fix that should ship. The implementation follows PyInstaller's official _ORIG restoration protocol, the helper is placed in the right module layer, and the 10-test suite locks in the contract against future regressions. The Supply Chain Security flag about other subprocess call sites is real but is a follow-up item, not a reason to hold this PR -- the PR correctly solves the reported issue (#894) and establishes the canonical pattern that those other sites can adopt. I ratify APPROVE. The one concrete action before merge is to open a follow-up issue for the remaining subprocess call sites; the CHANGELOG entry and test coverage for this PR are already in good shape.


Required actions before merge

  1. None required for this PR. However, open a follow-up issue titled "Apply external_process_env() to remaining subprocess call sites ([BUG] Fedora: brew apm fails on Git clone due to bundled OpenSSL mismatch #462 and runtime spawning)" referencing runtime/manager.py, core/azure_cli.py, core/token_manager.py, deps/github_downloader.py, and commands/_helpers.py as the sites to audit. The docstring already calls this the "single source of truth" -- the follow-up ensures that claim becomes true.

Optional follow-ups

  • Track the remaining ~15 subprocess.run()/Popen call sites in a dedicated issue. The core/azure_cli.py case (spawning az account get-access-token) is particularly worth auditing since it runs on Linux and could hit the same OpenSSL ABI mismatch.
  • Consider whether DYLD_INSERT_LIBRARIES (macOS) should be added to _PYINSTALLER_MANAGED_LIBRARY_VARS if future PyInstaller versions manage it. Not a current concern, but the constant is easy to extend.
  • The runtime spawning sites (copilot_runtime.py, codex_runtime.py, llm_runtime.py) run long-lived child processes (MCP servers). Sanitising their env via external_process_env() would also prevent the library-path leak from affecting any native extensions those servers load.

Generated by PR Review Panel for issue #899 · ● 1M ·

@danielmeppiel danielmeppiel enabled auto-merge April 24, 2026 21:52
@danielmeppiel danielmeppiel added this pull request to the merge queue Apr 24, 2026
@danielmeppiel danielmeppiel removed this pull request from the merge queue due to a manual request Apr 24, 2026
@danielmeppiel danielmeppiel merged commit 4e3c2ec into microsoft:main Apr 24, 2026
7 checks passed
danielmeppiel added a commit that referenced this pull request Apr 24, 2026
…#921)

* fix(ci): add merge_group trigger to Merge Gate so it reports in queue

Branch protection / merge-queue ruleset requires the 'gate' check on
both PR-time and merge-queue contexts, but the gate workflow only
fired on 'pull_request'. In the merge queue, GitHub fires 'merge_group'
events against a temp merge commit -- the gate check was never created
on that SHA, so PRs sat in the queue with 'gate' stuck in
'Expected -- Waiting for status to be reported' indefinitely
(observed on PR #899).

Changes
-------
.github/workflows/merge-gate.yml
- Add 'merge_group' (types: checks_requested) and keep existing
  'pull_request' + 'workflow_dispatch' triggers.
- Resolve head SHA per event:
    workflow_dispatch -> gh api .../pulls/N --jq .head.sha
    merge_group       -> github.event.merge_group.head_sha
    pull_request      -> github.event.pull_request.head.sha
- Branch EXPECTED_CHECKS by event:
    pull_request / workflow_dispatch: 'Build & Test (Linux),APM Self-Check'
    merge_group: + 'Build (Linux),Smoke Test (Linux),
                   Integration Tests (Linux),Release Validation (Linux)'
  (the merge_group-only checks emitted by ci-integration.yml plus the
  ci.yml checks that also run on merge_group)
- Bump TIMEOUT_MIN 30 -> 55 and job timeout-minutes 35 -> 60 to absorb
  ci-integration.yml's theoretical worst-case critical path (Build ->
  Smoke -> Integration[20m] -> Release Validation[20m]).
- Update header comment + recovery instructions to cover both contexts.

.github/scripts/ci/merge_gate_wait.sh
- Accept new optional EVENT_NAME env var; emit event-aware recovery
  instructions on exit code 2 (in merge_group context, pushing a commit
  does NOT retrigger the merge_group event -- the user must re-queue).
- Add '&filter=latest' to the Checks API query so GitHub returns only
  the latest run per name, removing reliance on client-side sort and
  pagination order.

Concurrency
-----------
The existing key 'merge-gate-${{ pull_request.number || inputs.pr_number
|| github.ref }}' falls through to github.ref in merge_group context.
github.ref there is 'refs/heads/gh-readonly-queue/main/pr-N-<sha>',
unique per queue entry, so cancel-in-progress dedupes correctly within
a single temp branch and never collides across PR/merge_group channels.

Self-deadlock
-------------
'gate' is intentionally absent from EXPECTED_CHECKS in both contexts.

Audit
-----
Design audited against live GitHub docs:
- docs.github.com/.../webhook-events-and-payloads#merge_group
- docs.github.com/.../managing-a-merge-queue
- docs.github.com/en/rest/checks/runs
Verdict: ship with the event-aware recovery message included here.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(ci): drop paths-ignore from gate + ci so docs-only PRs satisfy gate

Both .github/workflows/merge-gate.yml and .github/workflows/ci.yml
carried identical paths-ignore (docs/**, .gitignore, LICENSE). For a
docs-only PR neither workflow fires, so the 'gate' check-run is never
created -- if the PR ruleset requires 'gate', branch protection
displays it as 'Expected -- Waiting' forever and the PR cannot merge.

Removing paths-ignore from BOTH (not just one) is required: dropping
it only from merge-gate.yml would leave the gate polling for ci.yml
checks that never appear, timing out at TIMEOUT_MIN with exit 2 (false
failure). Removing from both means ci.yml runs on docs-only PRs (~5
min of free GitHub-hosted runner time) and the gate aggregates as
normal -- coherent regardless of which ruleset tier requires gate.

Caught in code review on PR #921. Same observation was flagged but
left out-of-scope in the original PR description; folding in now.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

panel-review Trigger the apm-review-panel gh-aw workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] apm update fails

2 participants