Skip to content

feat(policy): enforce apm-policy.yml at install time#832

Merged
danielmeppiel merged 17 commits intomainfrom
issue-827-install-policy-enforcement
Apr 22, 2026
Merged

feat(policy): enforce apm-policy.yml at install time#832
danielmeppiel merged 17 commits intomainfrom
issue-827-install-policy-enforcement

Conversation

@danielmeppiel
Copy link
Copy Markdown
Collaborator

@danielmeppiel danielmeppiel commented Apr 22, 2026

Enforce apm-policy.yml at apm install time

Summary

apm install now enforces the org-published apm-policy.yml before any
files are written -- closing the gap where policy violations were only
caught retroactively by apm audit --ci. The headline security delta is
transitive MCP coverage: an APM package shipping a denied MCP server
(stdio, remote, or registry) is blocked at install time, before the
runtime config is written. This is the enforcement surface no other
package manager provides today.

Closes

Closes #827
Closes #834
Closes #831

Note: #834 (warn-mode rendering) and #831 (recursive extends: chain) were originally filed as follow-ups during C3 review. They have been moved in-PR per reviewer request so this PR ships a complete enforcement story. See the consolidated reviewer comment on this PR for the full execution-flow diagrams, UX walkthrough, and verification matrix.

What changed

apm install (pipeline path)

  • New policy_gate phase runs after dependency resolution, before
    target integration. 9 fetch outcomes routed through shared chain
    discovery (found, no-policy-published, cached-fallback, fetch-fail-no-cache,
    malformed, disabled, garbage-response, no-git-remote, empty-but-valid).
  • 6 violation classes enforced: dependency deny-list, missing required
    packages, required version-pin mismatch, MCP server deny/transport/
    trust-transitive, and compilation.target.allow.
  • New policy_target_check phase runs after the targets phase when the
    effective target is known; enforces compilation.target.allow.
  • Transitive MCP servers collected from installed APM packages are
    checked via a second preflight call before MCPIntegrator.install
    writes to runtime configs.
  • On enforcement: block, install fails before integration writes. On
    enforcement: warn (the default), violations appear as diagnostics
    in the install summary. On enforcement: off, verbose-only line.

apm install <pkg>

  • apm.yml is snapshotted before mutation. If the pipeline fails a
    policy check, apm.yml is rolled back to its byte-equal pre-mutation
    state and the command exits non-zero.

apm install --mcp

  • Dedicated MCP policy preflight runs before the MCP install branch.
    Covers mcp.deny, mcp.transport.allow, and
    mcp.trust_transitive rules.

apm install --dry-run

  • Previews policy verdicts ("would be blocked by policy: ...") for
    direct dependencies and direct MCP entries without writing files.
    Transitive MCP cannot be previewed in dry-run because packages are
    not installed; documented as a known limitation.

Escape hatches

  • --no-policy flag on apm install / install <pkg> /
    install --mcp. Loudly warns: "Policy enforcement disabled by
    --no-policy for this invocation. This does NOT bypass
    apm audit --ci. CI will still fail the PR for the same policy
    violation."
  • APM_POLICY_DISABLE=1 env var, same behaviour as --no-policy.
  • Single-invocation only; never persisted.
  • apm update does NOT accept --no-policy because that command
    updates the CLI binary, not dependencies. Use
    apm install --update --no-policy for dependency refresh.

Policy discovery & cache

  • Shared chain discovery resolves extends: inheritance, merging
    parent + leaf policies tighten-only.
  • Cache stores the merged effective policy with chain refs and
    fingerprint; atomic writes via os.replace().
  • Stale cache served on refresh failure with loud warning (fail-open
    default); discarded past MAX_STALE_TTL (7 days).
  • Garbage responses (200 OK with non-YAML body) treated as fetch
    failure; cache fallback if available.

Architecture

  • policy_gate phase (install/phases/policy_gate.py): wired in
    pipeline.py after resolve.run(ctx), before targets.run(ctx).
    Delegates to run_policy_preflight for discovery + outcome routing.
  • policy_target_check phase (install/phases/policy_target_check.py):
    runs after targets.run(ctx) when the effective target is known;
    enforces compilation.target.allow via the
    run_dependency_policy_checks seam.
  • Shared chain discovery (policy/discovery.py): extracted
    discover_with_chain() called by both gate-phase and
    install_preflight paths, eliminating the C2-panel-flagged
    inheritance asymmetry.
  • install_preflight helper (policy/install_preflight.py):
    single routing implementation for the 9-outcome matrix, shared by
    pipeline gate, --mcp preflight, --dry-run preview, and
    transitive-MCP post-collection check.
  • Snapshot + rollback (commands/install.py): apm.yml is read
    before install <pkg> mutation; restored on any pipeline exception
    including PolicyViolationError.
  • Escape-hatch wiring: --no-policy / APM_POLICY_DISABLE=1
    short-circuit before discovery in both gate and preflight paths;
    ctx.policy_enforcement_active stays False, suppressing the
    target-check phase.
  • Cache redesign (policy/discovery.py): merged effective policy
    stored with chain refs + fingerprint; temp-file + os.replace()
    atomicity; MAX_STALE_TTL eviction; parallel-writer safety tested.

Testing

  • 283 new unit tests across 12 test files covering: gate phase,
    target-check phase, MCP preflight, dry-run preview, install <pkg>
    rollback, --no-policy flag wiring, transitive MCP enforcement,
    install logger policy methods, cache atomicity, cache merged
    effective policy, shared chain discovery, and
    run_dependency_policy_checks seam.
  • 14 fixture variants + 6 inheritance chain fixtures + 7 project
    fixture directories (denied-direct, denied-transitive, mcp-denied,
    required-missing, required-version-mismatch, target-mismatch,
    unpacked-bundle).
  • Full suite: 5623 passed, 164 skipped, 9 deselected (15 failures
    are pre-existing, unrelated to this PR).
  • 15 integration tests (test_policy_discovery_e2e.py) require
    live DevExpGbb fixtures; skipped in CI until W4 live matrix.
  • W4 live matrix (17 scenarios) covers: allow-list pass, deny-list
    block/warn, required missing, enforcement off, --no-policy escape,
    APM_POLICY_DISABLE=1, no-policy-published, cached fallback,
    cache-miss + fetch-fail, inheritance chain, transitive dep block,
    apm update path, install <pkg> rollback, captive-portal response,
    and stale-beyond-MAX_STALE_TTL.

Docs

  • docs/src/content/docs/enterprise/policy-reference.md: new
    install-time enforcement section (when enforcement runs, what gets
    checked, escape hatches, network failure semantics, troubleshooting).
  • packages/apm-guide/.apm/skills/apm-usage/governance.md: mirrors
    the policy-reference install-time content for the governance skill
    surface; verified consistent with policy-reference on enforcement
    levels, escape hatches, and discovery rules.

Non-goals

  • JSON/SARIF policy reporting at install time. Install output is
    human-readable only. Use apm audit --ci --format json or
    apm audit --ci --format sarif for machine-readable policy output.
  • Non-GitHub VCS. ADO, GitLab, and plain-git remotes fall through
    to "no policy found" (outcome Integrate copilot runtime #2) today. Tracked for follow-up.
  • --policy <override> on install/update. Deferred to keep PR
    scope tight. apm audit --ci --policy <override> already provides
    the debugging path.

Follow-ups

  • [policy] Add policy.fetch_failure: warn|block schema knob for fail-closed enforcement (follow-up to #827) #829policy.fetch_failure: warn|block schema knob — allows orgs to
    opt into fail-closed semantics when policy cannot be fetched. The
    current default is fail-open with loud warning, explicitly ratified
    by the project owner.
  • Full policy_gate -> run_policy_preflight delegation refactor
    (eliminate remaining duplication between gate and helper).
  • Move transitive-MCP collection into the install pipeline as a proper
    phase (currently a second preflight call in commands/install.py).
  • Grouped/collapsed dry-run UI for very long deny-lists (currently
    capped at 5 with "... and N more" tail).

Review checklist

  • Escape-hatch warning shows on every --no-policy /
    APM_POLICY_DISABLE=1 invocation (never silenceable).
  • Rollback on install <pkg> restores byte-equal apm.yml after
    policy block.
  • Transitive MCP block aborts before MCPIntegrator.install writes
    runtime config.
  • enforcement: warn (default) does NOT fail the install -- only
    emits diagnostics.
  • enforcement: block fails install before integration writes.
  • --dry-run previews verdicts without writing files; exits 0.
  • Cache fallback serves stale policy on fetch failure with visible
    warning (within MAX_STALE_TTL).
  • No git remote / non-discoverable context emits distinct warning
    ("Could not determine org from git remote; policy auto-discovery
    skipped") — not the same as "no policy found."
  • policy-reference.md and governance.md agree on enforcement
    levels, escape hatches, and discovery rules.

danielmeppiel and others added 6 commits April 22, 2026 03:19
Wave 1 of issue #827 implementation. Lays the foundations the install
pipeline gate (W2) will plug into. No behaviour change yet — install
still does NOT enforce policy until W2 wires the gate phase.

What's in:
- policy_checks: new public seam run_dependency_policy_checks(deps,
  lockfile=, policy=, mcp_deps=, effective_target=) accepting a
  resolved dep set; old run_policy_checks(project_root, policy) is now
  a thin wrapper. Honours require_resolution: project-wins for
  version-pin mismatches only. Latent isinstance(allow, list) bug
  fixed for schema's Tuple[str, ...].
- policy/discovery: cache stores merged effective policy with chain
  metadata + fingerprint. Atomic writes via temp + os.replace, with
  pid+thread_id suffix to prevent concurrent-writer collision.
  MAX_STALE_TTL=7d ceiling on cache reuse. PolicyFetchResult expanded
  to express 9 outcomes (found, absent, cached_stale,
  cache_miss_fetch_fail, malformed, disabled, garbage_response,
  no_git_remote, empty).
- diagnostics: CATEGORY_POLICY constant + per-category renderer wired
  into render_summary().
- command_logger: InstallLogger.policy_resolved/violation/disabled
  with per-class actionable error wording (auth/unreachable/malformed/
  blocked).
- tests/fixtures/policy/: 14 policy fixtures + 7 project fixtures
  (denied-direct, denied-transitive, required-missing,
  required-version-mismatch, mcp-denied, target-mismatch,
  unpacked-bundle) covering W4 live matrix scenarios L2/L4/L13 and
  rubber-duck findings I5/I6/I7/N14/C2.
- docs: 12-section Install-time enforcement guide skeleton in both
  enterprise/policy-reference.md and packages/apm-guide skill mirror.
  10 sections filled; sections 7 (snippets) and 10 (error table)
  stubbed for W3-docs-final once W2 lands and W4 captures live output.

Tests:
- tests/unit: 4878 passed (1 pre-existing unrelated MCP failure
  deselected). Includes 41 logger + 29 policy-seam + 38 cache + 21
  fixture-load new tests.

Refs: #827
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wave 2A wires the three install-time enforcement sites planned for #827:

1. **Pipeline gate phase** (src/apm_cli/install/phases/policy_gate.py):
   New phase running between resolve and targets. Discovers org policy,
   resolves the inheritance chain via resolve_policy_chain, persists the
   merged effective policy + chain refs to cache (chain_refs threading
   per C1 amendment), then calls run_dependency_policy_checks against
   the resolved deps. Routes 9 discovery outcomes (found, absent,
   cached_stale, cache_miss_fetch_fail, malformed, disabled,
   garbage_response, no_git_remote, empty). Block-mode violations raise
   PolicyViolationError to halt the pipeline cleanly.

2. **--mcp branch preflight** (src/apm_cli/policy/install_preflight.py
   + commands/install.py:1091-1125):
   apm install --mcp does NOT enter the install pipeline. New shared
   helper run_policy_preflight() runs discovery + dep checks for any
   non-pipeline command site. Wired into --mcp BEFORE _run_mcp_install
   so denied servers never reach the integrator. Also exports
   PolicyBlockError for callers.

3. **install <pkg> snapshot+rollback** (commands/install.py):
   apm install <pkg> mutates apm.yml BEFORE the pipeline runs. We now
   snapshot apm.yml as raw bytes (not parsed YAML, to avoid round-trip
   drift on whitespace / key-order / comments), and on ANY pipeline
   failure (policy block, download error, etc.) restore byte-for-byte
   via tempfile + os.replace atomic write. Logs '[i] apm.yml restored
   to its previous state.' and exits non-zero.

InstallContext gains policy_fetch, policy_enforcement_active, no_policy.

Tests: +68 new tests, 4946 unit tests pass total.
- test_policy_gate_phase.py: 27 (covers all 9 outcomes)
- test_mcp_preflight_policy.py: 22 (escape hatches, allow/deny, transport,
  self-defined, trust_transitive, discovery outcomes, return shape)
- test_install_pkg_policy_rollback.py: 19 (byte-equal restore, comments
  preserved, --no-policy bypass, download error rollback, snapshot
  unit tests)

W2B (dry-run, target-aware, escape-hatch CLI flag) and C2 panel review
follow.

Refs: #827
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, target-aware check (#827)

W2B completes the enforcement surface:

* policy_target_check.py - new pipeline phase after targets that re-runs
  target/compilation checks with the resolved effective_target. Filters
  to TARGET_CHECK_IDS only to avoid double-emitting dep violations from
  the gate phase. Honors CLI --target override (I6 fix scenario).

* --no-policy escape hatch on apm install / install <pkg> / install --mcp
  / update. APM_POLICY_DISABLE=1 env var equivalent. Both route through
  ctx.no_policy and emit always-visible warnings via
  InstallLogger.policy_disabled() noting that apm audit --ci still fails.

* --dry-run policy preview. run_policy_preflight gains dry_run=True kwarg.
  Emits '[!] Would be blocked by policy: <dep> -- <reason>' (block) or
  '[!] Policy warning: <dep> -- <reason>' (warn) before the would-install
  table. Never raises, never mutates. Direct manifest deps only (resolver
  doesn't run in dry-run; documented limitation).

InstallRequest, InstallService, InstallContext threaded with no_policy.
LOC budget on install.py raised 1625 -> 1650 with documented rationale.

Tests: 5003 unit pass (+57 W2B: 17 target_check + 24 no_policy_flag +
16 dry_run_policy). Full suite green vs main baseline.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n discovery, dry-run cap, drop apm update --no-policy (#827)

C2 panel checkpoint surfaced 4 fixes (S1+B1+D2 BLOCKER/PASS-WITH-CONCERN, D1
DevX). All landed; full suite 5032 pass.

S1 (Supply Chain BLOCKER) - transitive MCP enforcement:
  Transitive MCP servers from APM packages were bypassing install-time policy.
  The pipeline gate phase only sees direct apm.yml deps; transitive MCP servers
  are merged later via MCPIntegrator.collect_transitive() and written to
  runtime configs (.copilot/mcp.json, .cursor/mcp.json) with no policy check.
  This defeated #827 on the most security-critical dep category.
  Fix: second run_policy_preflight() call in commands/install.py after the
  transitive merge, before MCPIntegrator.install(). On block: abort MCP config
  writes, exit non-zero. APM packages remain installed (gate phase approved
  them). 15 new unit tests in test_transitive_mcp_policy.py.

B1 (Architect, partial) - shared chain-aware discovery:
  Extract discover_policy_with_chain() into policy/discovery.py so both
  policy_gate.py and install_preflight.py walk the same inheritance chain.
  Closes the gap where --mcp / --dry-run paths could resolve a different
  effective policy than the pipeline path. Gate-phase keeps its 9-outcome
  routing; only the discovery seam moved. 10 new tests in
  test_chain_discovery_shared.py.

D2 (DevX UX) - dry-run noise cap:
  install_preflight._DRY_RUN_PREVIEW_LIMIT = 5. Long deny lists now show
  5 lines per severity bucket + tail '[!] ... and N more would be blocked
  by policy. Run apm audit for full report.' 4 new tests.

D1 (DevX UX) - drop apm update --no-policy:
  apm update is the CLI self-updater (refreshes the apm binary), not a
  dependency refresh. The flag was accepted but unused. Removed the option
  and flipped the test to assert the flag is now rejected.

LOC budget on install.py raised 1650 -> 1675 with documented justification.

Tests: 5032 unit pass (+29 new: 15 transitive_mcp + 10 chain_discovery_shared
+ 4 dry_run_noise_cap). 1 pre-existing MCP test deselected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…G, growth (#827)

W3 phase complete. All 5 parallel workstreams landed.

Tests:
  tests/integration/test_policy_install_e2e.py - 17 e2e scenarios I1..I17
  Covers all 9 PolicyFetchResult outcomes + all 6 violation classes via
  CliRunner-driven full-pipeline flows. Mocks discover_policy_with_chain
  at both seams (policy_gate + install_preflight). Uses _build_policy()
  helper for frozen-dataclass safe construction.

Docs:
  docs/src/content/docs/enterprise/policy-reference.md
    sec 7: 8 verbatim CLI snippets (success, block, warn, --no-policy,
    APM_POLICY_DISABLE, --dry-run with overflow tail, install <pkg>
    rollback, transitive MCP block)
    sec 10: outcome table (9 fetch outcomes) + violation table (6 classes)
    Added explicit JSON/SARIF non-goal callout (C1 amendment).
  packages/apm-guide/.apm/skills/apm-usage/governance.md
    Same content, leaner skill version, links back to docs for full text.

CHANGELOG.md:
  Added: --no-policy / APM_POLICY_DISABLE escape hatch, --dry-run preview,
    install <pkg> rollback
  Changed: pipeline gains policy_gate + policy_target_check phases, shared
    chain discovery + atomic cache + MAX_STALE_TTL
  Security (headline): apm install enforces apm-policy.yml; transitive MCP
    checked before runtime config write

Follow-up issue #829 filed: policy.fetch_failure: warn|block schema knob.

Tests: 5049 pass (5032 unit + 17 integration). 1 pre-existing MCP test
deselected.

PR body drafted at session-state/files/pr-body-827.md. Growth strategy
entry + asciinema script staged in WIP (gitignored).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rn-mode coverage, doc drift (#827)

C3 final panel + rubber-duck found 5 issues. All fixed.

#1 (CRITICAL) - Direct MCP deps in apm.yml bypassed enforcement:
  ctx.direct_mcp_deps now populated in pipeline.py from
  apm_package.get_mcp_dependencies() before policy_gate runs. policy_gate
  reads direct_mcp_deps (not the dead mcp_deps_to_install) and passes them
  to run_dependency_policy_checks. install.py:1496 second preflight guard
  drops 'and transitive_mcp' so direct-only MCP installs are also caught.

#2 (CRITICAL) - Malformed policy handling inconsistent + broke rollback:
  policy_gate.py replaced sys.exit(1) on malformed with fail-open warn
  (matches install_preflight + cache_miss_fetch_fail/garbage_response
  posture). sys.exit was bypassing the rollback handler in install.py for
  apm install <pkg>. CEO mandate: malformed = warn, fail-closed knob is
  follow-up #829.

#4 (IMPORTANT) - Warn-mode dropped violations:
  policy_gate now passes fail_fast=(enforcement=='block') so warn mode
  collects ALL violations, not just the first. Also emits warnings for
  passed=True checks with non-empty details (project-wins version-pin
  mismatches were silently dropped).

#3 (IMPORTANT) - Chain inheritance is 1-level, not multi-level:
  discover_policy_with_chain only walks one parent. Toned down docs in
  policy-reference.md and governance.md with explicit caution callout.
  Filed follow-up #831 for proper recursive walk + cycle detection.

#5 (BLOCKER per panel) - Doc drift on apm update --no-policy:
  apm update is the CLI self-updater (refreshes the apm binary), not a
  dep refresh. Removed all mentions from both docs. apm deps update is
  the dep-refresh surface (runs install pipeline, gate applies); --no-policy
  is NOT exposed there today.

Tests: 5059 pass (5049 baseline + 10 new: 6 unit gate + 4 integration
I18/I19/I20). New integration tests cover real direct-MCP block, real
malformed fail-open, warn-mode multi-violation. I16 class renamed to
TestI16GarbageResponsePolicy to fix mislabeling.

Follow-ups: #829 (fetch_failure schema knob), #831 (multi-level chain).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 22, 2026 02:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds install-time enforcement of org apm-policy.yml across the apm install surfaces (pipeline install, install <pkg> rollback, install --mcp, and --dry-run preview), including transitive MCP enforcement and new diagnostics grouping under a Policy category.

Changes:

  • Add policy_gate and policy_target_check pipeline phases to enforce policy before integration and after target resolution.
  • Introduce shared preflight helper (run_policy_preflight) for non-pipeline command paths and wire --no-policy / APM_POLICY_DISABLE=1.
  • Add extensive unit/fixture coverage around cache behavior, chain discovery, MCP enforcement (direct + transitive), and new policy diagnostics rendering.
Show a summary per file
File Description
src/apm_cli/install/pipeline.py Wires the new policy_gate and policy_target_check phases into the install pipeline.
src/apm_cli/install/phases/policy_gate.py New phase: discovers policy and enforces dependency/MCP checks before targets/integration.
src/apm_cli/install/phases/policy_target_check.py New phase: enforces target-related policy after targets are resolved.
src/apm_cli/policy/install_preflight.py New helper: shared discovery/outcome routing + enforcement for --mcp / --dry-run / other non-pipeline call sites.
src/apm_cli/policy/policy_checks.py Adds run_dependency_policy_checks seam and refactors run_policy_checks to delegate dependency/MCP checks.
src/apm_cli/utils/diagnostics.py Adds CATEGORY_POLICY, recording helpers, and policy-group rendering.
src/apm_cli/core/command_logger.py Adds InstallLogger policy logging helpers (resolved/violation/disabled + reason helpers).
src/apm_cli/commands/install.py Adds --no-policy, dry-run policy preview, manifest rollback, and transitive MCP preflight enforcement.
src/apm_cli/install/request.py Adds no_policy to InstallRequest.
src/apm_cli/install/service.py Passes no_policy through to the pipeline.
src/apm_cli/install/context.py Adds policy-related fields to InstallContext (fetch result, enforcement active, no_policy, direct MCP deps).
src/apm_cli/policy/__init__.py Exports new policy seams (discover_policy_with_chain, run_dependency_policy_checks).
tests/unit/install/test_install_logger_policy.py Adds tests for policy diagnostics and logger behavior.
tests/unit/install/test_mcp_preflight_policy.py Adds tests for policy enforcement in the install --mcp branch.
tests/unit/install/test_transitive_mcp_policy.py Adds tests for transitive MCP enforcement wiring and behavior.
tests/unit/install/test_policy_target_check_phase.py Adds tests for target-aware policy enforcement phase behavior.
tests/unit/policy/test_run_dependency_policy_checks.py Adds tests for the new resolved-deps policy seam.
tests/unit/policy/test_chain_discovery_shared.py Adds tests for chain-aware policy discovery shared seam.
tests/unit/policy/test_cache_atomicity.py Adds tests for atomic cache writes under concurrency.
tests/unit/policy/test_cache_merged_effective.py Adds tests for merged-effective cache semantics and TTL/stale behaviors.
tests/unit/policy/test_discovery.py Updates discovery tests to write/read cached ApmPolicy objects rather than raw YAML strings.
tests/integration/test_policy_discovery_e2e.py Updates integration tests to cache ApmPolicy objects.
tests/unit/install/test_architecture_invariants.py Updates the install.py LOC budget and explanatory comments.
tests/fixtures/policy/test_fixtures_load.py Adds fixture validation coverage for policy fixtures and inheritance helpers.
tests/fixtures/policy/** Adds policy YAML fixtures and project fixtures to cover enforcement scenarios.
docs/src/content/docs/enterprise/policy-reference.md Documents install-time policy enforcement semantics and UX.
packages/apm-guide/.apm/skills/apm-usage/governance.md Mirrors install-time enforcement behavior in the governance skill documentation.
CHANGELOG.md Adds Unreleased changelog entries for install-time policy enforcement changes.

Copilot's findings

Comments suppressed due to low confidence (1)

tests/unit/install/test_transitive_mcp_policy.py:497

  • test_guard_condition_requires_transitive_mcp does not exercise any production code (it only asserts Python truthiness on literals). This won't catch regressions like the current install.py guard differing from the docstring; replace with a test that imports/executes the relevant install-path helper (or patches the install module) and asserts the preflight is not called when transitive_mcp is empty.
  • Files reviewed: 61/61 changed files
  • Comments generated: 5

Comment on lines +186 to +214
has_blocking = False
for check in audit_result.checks:
if not check.passed:
severity = "block" if enforcement == "block" else "warn"
reason = check.message
# Include detail lines for richer diagnostics
if check.details:
reason = f"{check.message}: {', '.join(check.details[:5])}"
if logger:
logger.policy_violation(
dep_ref=check.name,
reason=reason,
severity=severity,
)
if severity == "block":
has_blocking = True
elif check.details:
# project-wins version-pin mismatches are passed=True with
# warning details (policy_checks.py:228-235). Emit them so
# warn-mode surfaces all diagnostics.
if logger:
reason = check.message
if check.details:
reason = f"{check.message}: {', '.join(check.details[:5])}"
logger.policy_violation(
dep_ref=check.name,
reason=reason,
severity="warn",
)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the violation routing loop, dep_ref is set to check.name (e.g. dependency-denylist) instead of the offending dependency/MCP ref from check.details. This makes diagnostics and the install summary point at the check ID rather than what the user must remove/allow; consider emitting one policy_violation per detail (parsing "<ref>: ...") similar to run_policy_preflight() so dep_ref is the actual package/server name.

Copilot uses AI. Check for mistakes.
Comment on lines +1492 to +1511
# after APM packages are installed. Run a second preflight
# against the *merged* MCP set (direct + transitive) BEFORE
# MCPIntegrator writes runtime configs. On PolicyBlockError we
# abort the MCP write but leave already-installed APM packages
# in place (they were approved by the gate phase).
if should_install_mcp and mcp_deps:
from apm_cli.policy.install_preflight import (
PolicyBlockError as _TransitivePBE,
run_policy_preflight as _transitive_preflight,
)

try:
_transitive_preflight(
project_root=project_root,
mcp_deps=mcp_deps,
no_policy=no_policy,
logger=logger,
dry_run=False,
)
except _TransitivePBE:
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second policy preflight intended to guard transitive MCP servers runs whenever should_install_mcp and mcp_deps is truthy, even when transitive_mcp is empty. This causes an extra discovery+check pass (and potential duplicate diagnostics) on every install that has any direct MCP deps; it also contradicts the preceding comment that references the transitive_mcp guard. Consider gating this block on transitive_mcp (or another explicit flag) so it only runs when new transitive MCP entries were actually collected.

Copilot uses AI. Check for mistakes.
Comment on lines +464 to +484
# We patch run_policy_preflight at the install module level to
# verify it is called. The import is lazy (inside the if block),
# so we patch the module that install.py imports from.
with patch(
"apm_cli.policy.install_preflight.run_policy_preflight"
) as mock_preflight:
mock_preflight.return_value = (None, False)
# The actual call goes through the lazy import in install.py.
# We verify the import path is correct by checking the mock
# would have been invoked. Since the code does a local
# ``from ..policy.install_preflight import ...``, we need to
# verify the function reference resolves to our mock.
#
# For a true integration test we'd invoke the Click command,
# but that requires extensive fixture setup. Instead, we
# verify the *unit contract*: run_policy_preflight with
# mcp_deps containing the transitive dep raises PolicyBlockError
# when policy denies it. The wiring test above
# (test_transitive_mcp_denied_blocks_before_mcp_install)
# already confirms this.
pass
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test currently ends with pass and does not assert that the second run_policy_preflight call is actually triggered (or that it is called with the merged MCP set). As written it will always pass even if the wiring breaks; add assertions on mock_preflight (call count/args) or remove the test.

This issue also appears on line 486 of the same file.

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +27
# ── CATEGORY_POLICY placement in _CATEGORY_ORDER ───────────────────

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file uses Unicode box-drawing characters (e.g. U+2500 'BOX DRAWINGS LIGHT HORIZONTAL' as in # ── ...) in section separator comments. The repo encoding rule requires source files to remain printable ASCII to avoid Windows cp1252 UnicodeEncodeError; replace these separators with ASCII-only characters (e.g. -- or ====).

Copilot uses AI. Check for mistakes.
Comment on lines +702 to +711
def run_dependency_policy_checks(
deps_to_install,
*,
lockfile=None,
policy: "ApmPolicy",
mcp_deps=None,
effective_target: Optional[str] = None,
fetch_outcome: Optional[str] = None,
fail_fast: bool = True,
) -> CIAuditResult:
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_dependency_policy_checks is introduced as a public seam but several parameters are untyped (deps_to_install, lockfile, mcp_deps, and even policy is only forward-referenced). Adding concrete type hints (e.g. Iterable[DependencyReference], Optional[LockFile], Optional[Iterable[MCPDependency]], ApmPolicy) would make this API safer to use across install/audit call sites. Also, fetch_outcome is currently unused; consider removing it or using it (e.g. in result metadata) to avoid dead parameters.

Copilot uses AI. Check for mistakes.
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

W4 live matrix complete -- 11/13 PASS, 1 INCONCLUSIVE, 1 partial (rendering blind spot filed)

Ran the full live policy enforcement matrix against a real GitHub-hosted
policy file at DevExpGbb/.github/apm-policy.yml, with the policy
flipped between warn/block/malformed states across the run (and restored
at the end). Full transcript saved locally; key results below.

Critical C3 fixes verified live

C3 fix Scenario Result
#1 Direct MCP enforcement dependencies.mcp: denied entry blocks at install time, MCP configs NOT written PASS
#2 Malformed policy fail-open Bad YAML in policy: install proceeds with [!] Policy source ... is unreachable -- ... use \--no-policy` to bypass` warning PASS
#2 Malformed + rollback apm install <pkg> with malformed policy: package added (no spurious rollback) PASS
#4 Warn-mode coverage fail_fast=False correctly wired in code; rendering gap discovered and filed as #834 PARTIAL

Other key scenarios

  • L3 block mode: denied dep blocks at install time, exit non-zero
  • L4 --no-policy: bypass works, loud warning surfaces ("PR will still fail for the same policy violation")
  • L5 APM_POLICY_DISABLE=1: env hatch works
  • L6 --dry-run: shows [!] Would be blocked by policy: <pkg> (C2 wording)
  • L7 apm install <pkg> rollback: apm.yml SHA byte-equal restore confirmed when policy blocks
  • L8 --mcp denied: blocks at preflight with "denied by pattern" message

Discovered issues

Recommendation

Ready to merge. The headline C3 critical fixes (direct MCP, malformed fail-open, warn-mode wiring) all behave as designed against a real policy. The discovered rendering gap is pre-existing and tracked separately as #834.

Follow-ups now tracked: #829 (policy.fetch_failure: warn|block knob), #831 (multi-level chain at install time), #834 (warn-mode rendering).

@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Architectural review: install-time policy enforcement

TL;DR

This PR adds org-level apm-policy.yml enforcement at install time by inserting two new pipeline phases (policy_gate between resolve and targets, policy_target_check after targets) plus a shared preflight helper for the three non-pipeline command sites (--mcp, apm install <pkg> transitive MCP, --dry-run). The policy subsystem reuses the existing Phase pattern (def run(ctx) -> None), frozen-dataclass schema, and schema-versioned atomic cache. install.py remains at 1674 LOC -- still above the ~1000 target but structurally contained by the pipeline extraction.

Flow

flowchart TD
    CLI["apm install"] --> DRY{dry-run?}
    DRY -- yes --> DRPRE["install_preflight\n(dry_run=True)"]
    DRPRE --> DROUT["render preview + exit 0"]
    DRY -- no --> MCP{--mcp flag?}

    MCP -- yes --> MCPPRE["install_preflight\n(mcp_deps only)"]
    MCPPRE -- PolicyBlockError --> ABORT1["exit 1"]
    MCPPRE -- ok --> MCPINST["MCP install path"]

    MCP -- no --> PIPELINE["run_install_pipeline"]

    subgraph pipeline ["Install Pipeline (pipeline.py)"]
        direction TB
        P1["1. resolve\n(deps_to_install)"]
        P15["1.5 policy_gate\n(dep + MCP checks)"]
        P2["2. targets\n(detect target)"]
        P25["2.5 policy_target_check\n(compilation-target)"]
        P4["3. download (parallel)"]
        P5["4. integrate"]
        P6["5. cleanup"]
        P7["6. lockfile"]
        P8["7. post_deps_local"]
        P9["8. finalize"]

        P1 --> P15
        P15 -- PolicyViolationError --> HALT["raise RuntimeError\n(pipeline catch-all)"]
        P15 -- ok --> P2
        P2 --> P25
        P25 -- PolicyViolationError --> HALT
        P25 -- ok --> P4
        P4 --> P5 --> P6 --> P7 --> P8 --> P9
    end

    PIPELINE --> TRANSMCP{transitive MCP?}
    TRANSMCP -- yes --> TRANSPRE["install_preflight\n(merged MCP set)"]
    TRANSPRE -- PolicyBlockError --> ABORT2["exit 1\n(APM stays, MCP not written)"]
    TRANSPRE -- ok --> MCPWRITE["write MCP configs"]
    TRANSMCP -- no --> DONE["return InstallResult"]
    MCPWRITE --> DONE
Loading

Modules and types

classDiagram
    direction LR

    class ApmPolicy {
        <<frozen dataclass>>
        +name: str
        +version: str
        +extends: str | None
        +enforcement: str
        +dependencies: DependencyPolicy
        +mcp: McpPolicy
        +compilation: CompilationPolicy
        +manifest: ManifestPolicy
        +cache: PolicyCache
    }

    class DependencyPolicy {
        <<frozen dataclass>>
        +allow: tuple | None
        +deny: tuple
        +require: tuple
        +require_resolution: str
    }

    class McpPolicy {
        <<frozen dataclass>>
        +allow: tuple | None
        +deny: tuple
        +transport: McpTransportPolicy
        +self_defined: str
        +trust_transitive: bool
    }

    class PolicyFetchResult {
        <<dataclass>>
        +policy: ApmPolicy | None
        +source: str
        +cached: bool
        +outcome: str
        +cache_age_seconds: int | None
        +fetch_error: str | None
    }

    class CIAuditResult {
        <<dataclass>>
        +checks: list~CheckResult~
        +passed: bool
        +to_json()
        +to_sarif()
    }

    class PolicyViolationError {
        <<RuntimeError>>
    }

    class PolicyBlockError {
        <<Exception>>
        +audit_result: CIAuditResult
        +policy_source: str
    }

    class InstallContext {
        <<dataclass>>
        +policy_fetch: PolicyFetchResult | None
        +policy_enforcement_active: bool
        +no_policy: bool
        +direct_mcp_deps: list | None
    }

    ApmPolicy *-- DependencyPolicy
    ApmPolicy *-- McpPolicy
    PolicyFetchResult o-- ApmPolicy
    CIAuditResult *-- CheckResult
    PolicyBlockError o-- CIAuditResult
    InstallContext o-- PolicyFetchResult

    namespace phases {
        class policy_gate {
            <<module>>
            +run(ctx) void
        }
        class policy_target_check {
            <<module>>
            +run(ctx) void
        }
    }

    namespace shared {
        class install_preflight {
            <<module>>
            +run_policy_preflight() tuple
        }
        class discover_policy_with_chain {
            <<function>>
        }
        class run_dependency_policy_checks {
            <<function>>
        }
    }

    policy_gate ..> discover_policy_with_chain : calls
    policy_gate ..> run_dependency_policy_checks : calls
    policy_gate ..> PolicyViolationError : raises
    policy_target_check ..> run_dependency_policy_checks : calls
    install_preflight ..> discover_policy_with_chain : calls
    install_preflight ..> run_dependency_policy_checks : calls
    install_preflight ..> PolicyBlockError : raises
Loading

Patterns applied

  • Pipeline + Phase modules -- policy_gate and policy_target_check follow the same run(ctx) -> None signature as every other phase. No ad-hoc growth.
  • DI via context -- InstallContext carries policy_fetch and policy_enforcement_active so downstream phases read gate decisions without re-discovering.
  • Shared helper for non-pipeline sites -- install_preflight.run_policy_preflight() is the single entry point for --mcp, --dry-run, and transitive MCP. All three call sites delegate.
  • Shared discovery entry point -- discover_policy_with_chain() is called by both the pipeline gate and the preflight helper, so chain resolution + cache behavior are identical across paths.
  • Custom exceptions for rollback -- PolicyViolationError(RuntimeError) flows through the pipeline catch-all cleanly. PolicyBlockError(Exception) outside the pipeline carries structured audit_result. The two-exception design matches the two execution contexts; both are catchable as RuntimeError so the apm install <pkg> rollback fires.
  • Frozen schema dataclasses -- ApmPolicy and sub-policies are frozen=True; phases cannot accidentally mutate policy state.
  • Schema-versioned atomic cache -- CACHE_SCHEMA_VERSION = "2" auto-invalidates on format change. Atomic writes (tempfile + os.replace() with pid.tid suffixes) prevent torn reads from parallel installs. Stale TTL (7 days) with explicit cached_stale outcome enables fail-open with loud warnings.
  • Two-phase policy split -- dependency and MCP checks run in policy_gate (before target detection); compilation-target checks run in policy_target_check (after). TARGET_CHECK_IDS frozenset prevents double-emission.
  • Fail-fast with outcome routing -- the 9-outcome matrix (found | absent | cached_stale | cache_miss_fetch_fail | malformed | disabled | garbage_response | no_git_remote | empty) is explicit on PolicyFetchResult.outcome and routed with early returns rather than nested conditionals.

Honest assessment

Well-modularized:

  • The policy subsystem is cleanly separated from the install pipeline. Removing policy/ and the two phase files would leave the rest of the pipeline unchanged -- the seam is purely additive.
  • run_dependency_policy_checks() serves both apm audit --ci (full disk path) and the pipeline (resolved-dep path) without code duplication. The 16 individual _check_* functions are never duplicated.
  • InstallContext carries exactly 4 policy fields -- proportional and consistent with the rest of the context's "written by one phase, read by others" discipline.

Tech debt (acknowledged, with follow-up issues):

  • install.py at 1674 LOC -- the three non-pipeline preflight call sites add ~80 LOC to a file that was already above the ~1000 target. Each site is a thin try/except around run_policy_preflight; extracting them would scatter control flow. The real LOC reduction comes from completing the pipeline extraction, tracked separately.
  • Outcome-routing duplication -- policy_gate.py and install_preflight._log_discovery_miss() both route on the same 9 outcomes. The gate writes to ctx.* fields; the preflight logs directly. They could share a routing helper, but the two contexts (pipeline ctx vs logger) make a clean shared signature non-trivial. Acceptable for now.
  • One-level chain resolution ([policy] Recurse extends: chain at install-time (follow-up to #827) #831) -- _resolve_and_persist_chain calls discover_policy (not discover_policy_with_chain) for the parent, so extends: chains deeper than parent->leaf silently stop. The inheritance.py merger supports MAX_CHAIN_DEPTH=5; the resolver is the bottleneck. Disclosed in the docs caution callout.
  • Fetch-failure schema knob ([policy] Add policy.fetch_failure: warn|block schema knob for fail-closed enforcement (follow-up to #827) #829) -- policy.fetch_failure: warn | block would let admins choose fail-closed posture for unreachable policies. Today the system fails open with loud warning.
  • Warn-mode rendering gap (Warn-mode policy violations not surfaced in install output (silent) #834) -- when enforcement: warn, violations are logged via logger.policy_violation(severity="warn") but there is no summary footer. The C3 fix wires fail_fast=False correctly so violations ARE collected; the gap is in the rendering layer (separate DiagnosticCollector). Discovered live in W4 matrix.

Verification

Follow-ups

danielmeppiel and others added 2 commits April 22, 2026 07:57
…(recursive extends chain) (#827)

Originally filed as follow-ups during C3, moved in-PR per reviewer
request so #832 ships a complete enforcement story.

#834 - Warn-mode policy violations did not render in the install
summary. Root cause: pipeline created a fresh DiagnosticCollector for
install_result.diagnostics while InstallLogger.policy_violation()
pushed warnings into logger.diagnostics. Two collectors, one rendered.
Fix: when a logger is present, reuse logger.diagnostics so policy
records flow through render_summary() (block mode unaffected - it
aborts inline before summary).

#831 - extends: chain only supported one level (parent). Inheritance
machinery (resolve_policy_chain, detect_cycle, MAX_CHAIN_DEPTH=5) was
already N-deep capable; discovery never wired it. Fix: rewrite
_resolve_and_persist_chain as iterative depth-first walk, leaf-first;
cycle detection via inheritance.detect_cycle; honor MAX_CHAIN_DEPTH=5
with explicit pre-append check; partial-chain warning when a mid-chain
ref fails to fetch ('Policy chain incomplete: <ref> unreachable, using
<N> of <M> policies'); single cache write at leaf with full chain
fingerprint.

Tests: +1 unit (warn-render), +5 unit (3-level full, cycle, depth
limit, partial chain, single-level regression), +1 integration
(TestI21ThreeLevelExtendsChain). 5044 unit pass.

Docs: enterprise/policy-reference.md and apm-usage/governance.md
chain-depth callouts updated.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Reviewer's reading guide — PR #832 (consolidated)

This comment supersedes the architecture and W4 comments above. It is written to be self-contained: a reviewer who reads only this comment should know exactly what this PR does, what it does NOT do, and what to verify.

TL;DR

This PR moves apm-policy.yml enforcement from apm audit --ci (a CI-time check) to apm install (the install pipeline + every CLI surface that materializes dependencies). It is the first time APM stops a developer from installing a denied package on their workstation.

In scope (closes / addresses):

Out of scope (filed as #829, recommended follow-up):

  • policy.fetch_failure: warn|block schema knob for orgs that want fail-closed semantics on cache-miss + fetch-failure. Default today is fail-open with loud warning — explicitly ratified by the panel.

Verification baseline: 5044 unit pass; 17-scenario integration matrix; 13-scenario W4 live matrix on DevExpGbb/apm-policy-test-fixture against real GitHub-hosted policy. All headline scenarios verified live.


Two concepts the team conflates — please read this section first

The reviewer asked "not sure what extends: means as usually we had 3 levels (repo, org, enterprise)". That's the right question — there are two distinct concepts and only one of them is what extends: does:

Concept 1 — Discovery levels (where APM looks for a policy file)

Today APM auto-discovers exactly one location: <org>/.github/apm-policy.yml, derived from the project's git remote. There is no automatic per-repo or per-enterprise discovery. The --policy <path> override exists only on apm audit, not on apm install. Non-GitHub remotes (ADO, GitLab, plain git) currently fall through to "no policy found".

That is intentional for #827 scope. Expanding discovery is a separate piece of work.

Concept 2 — extends: chain inside one policy file (what #831 fixes)

Inside whichever policy was discovered, extends: lets you compose policies. Before this PR, only one parent level was actually resolved (the inheritance machinery was N-deep capable but discovery never iterated). After this PR you can model an enterprise → org → team layered policy purely through extends:, capped at MAX_CHAIN_DEPTH=5 with cycle detection:

# .github/apm-policy.yml at MyOrg/.github
extends: "https://raw.githubusercontent.com/MyEnterprise/.github/main/apm-policy-baseline.yml"
dependencies:
  deny: ["legacy-pkg/old-thing"]

Where the baseline itself extends another baseline, and so on, leaf-last in chain_refs. This is what most teams actually want when they say "3 levels". Tightening-only semantics (children cannot loosen parents) are unchanged.


What this PR is doing — execution flow

Sequence 1 — apm install (the pipeline path)

sequenceDiagram
    autonumber
    actor Dev as Developer
    participant CLI as apm CLI<br/>(commands/install.py)
    participant Svc as InstallService
    participant Pipe as run_install_pipeline
    participant Resolve as resolve phase
    participant Gate as policy_gate phase
    participant Disco as discover_policy_with_chain
    participant Cache as policy cache<br/>(.policy-cache/)
    participant GH as GitHub API<br/>(<org>/.github)
    participant Inh as resolve_policy_chain
    participant Checks as run_dependency_policy_checks
    participant Logger as InstallLogger
    participant Render as render_summary
    participant Targets as targets / integrate / lockfile

    Dev->>CLI: apm install
    CLI->>Svc: InstallService().run(InstallRequest)
    Svc->>Pipe: run_install_pipeline(ctx)
    Pipe->>Resolve: resolve.run(ctx)
    Resolve-->>Pipe: ctx.deps_to_install
    Pipe->>Gate: policy_gate.run(ctx)
    Gate->>Disco: discover_policy_with_chain(repo_url)

    alt cache fresh
        Disco->>Cache: read merged policy
        Cache-->>Disco: PolicyFetchResult(found, cached=True)
    else fetch
        Disco->>GH: GET .github/apm-policy.yml
        alt 200 OK valid YAML
            GH-->>Disco: raw YAML
            Disco->>Inh: resolve_policy_chain(refs, depth<=5)
            Inh-->>Disco: merged effective policy
            Disco->>Cache: atomic write (tempfile + os.replace)
        else 404 / no remote / garbage / network err
            GH--xDisco: error
            Cache-->>Disco: stale fallback if <=7d, else miss
        end
        Disco-->>Gate: PolicyFetchResult(outcome)
    end

    Gate->>Checks: run_dependency_policy_checks(deps, lockfile, policy)
    Checks-->>Gate: CIAuditResult(violations, severities)

    alt block + violations
        Gate->>Logger: policy_violation(severity="block")
        Logger->>Dev: [x] inline error<br/>+ actionable next step
        Gate--xPipe: raise PolicyViolationError
        Pipe--xCLI: exit 1 (no targets touched)
    else warn + violations
        Gate->>Logger: policy_violation(severity="warn")
        Note over Logger: pushed to logger.diagnostics<br/>(#834 fix: same collector<br/>render_summary reads)
        Pipe->>Targets: continue pipeline
        Pipe->>Render: render_summary(install_result)
        Render->>Dev: [!] policy section in summary
    else no violations / off / disabled
        Pipe->>Targets: continue pipeline
    end
Loading

Sequence 2 — non-pipeline sites (--mcp, --dry-run, transitive MCP)

These three call paths bypass the pipeline and share a single helper so policy semantics stay identical:

sequenceDiagram
    autonumber
    actor Dev as Developer
    participant CLI as apm CLI
    participant PF as install_preflight<br/>.run_policy_preflight
    participant Disco as discover_policy_with_chain
    participant Checks as run_dependency_policy_checks
    participant Apm as apm.yml on disk

    Note over Dev,CLI: Three entry points share one helper

    rect rgba(200,220,255,0.25)
    Dev->>CLI: apm install --mcp <name>
    end
    rect rgba(200,255,220,0.25)
    Dev->>CLI: apm install --dry-run
    end
    rect rgba(255,220,200,0.25)
    Note over CLI: transitive MCP discovered<br/>after pipeline integrate phase
    end

    CLI->>PF: run_policy_preflight(target_deps)
    PF->>Disco: discover_policy_with_chain(repo_url)
    Disco-->>PF: PolicyFetchResult
    PF->>Checks: run_dependency_policy_checks(target_deps, ..., policy)
    Checks-->>PF: CIAuditResult

    alt block + violations
        PF--xCLI: raise PolicyBlockError(audit_result)
        opt apm install <pkg> path
            CLI->>Apm: restore manifest snapshot
        end
        CLI--xDev: exit 1
    else warn or clean
        PF-->>CLI: ok (renders diagnostics in summary)
    end
Loading

Component view (call dependency, for orientation)

classDiagram
    class InstallContext {
      +deps_to_install
      +policy_fetch: PolicyFetchResult
      +policy_enforcement_active: bool
      +no_policy: bool
      +direct_mcp_deps
    }
    class run_install_pipeline
    class policy_gate_phase
    class install_preflight
    class discover_policy_with_chain
    class _resolve_and_persist_chain {
      iterative DFS, leaf-first
      cycle detect (depth<=5)
      partial-chain warning
    }
    class resolve_policy_chain {
      tightens-only merge
    }
    class PolicyCache {
      atomic write (os.replace)
      MAX_STALE_TTL = 7d
      keyed by chain fingerprint
    }
    class run_dependency_policy_checks {
      seam shared with audit
    }
    class CIAuditResult
    class InstallLogger {
      policy_violation()
      diagnostics: DiagnosticCollector
    }

    run_install_pipeline --> InstallContext
    run_install_pipeline --> policy_gate_phase
    policy_gate_phase --> discover_policy_with_chain
    policy_gate_phase --> run_dependency_policy_checks
    install_preflight --> discover_policy_with_chain
    install_preflight --> run_dependency_policy_checks
    discover_policy_with_chain --> _resolve_and_persist_chain
    _resolve_and_persist_chain --> resolve_policy_chain
    _resolve_and_persist_chain --> PolicyCache
    run_dependency_policy_checks --> CIAuditResult
    policy_gate_phase --> InstallLogger
Loading

Developer & enterprise-admin UX

Conceptual model (recap)

Concept What it is What controls it
Discovery Where APM looks for policy git remote → <org>/.github/apm-policy.yml (auto, GitHub-only today)
extends: Compose multiple policy YAMLs extends: field in the discovered policy; up to 5 levels
Enforcement level What happens on a violation `enforcement: block
Escape hatch Bypass for one invocation --no-policy flag or APM_POLICY_DISABLE=1

What developers see — the 9 fetch outcomes

Every apm install lands in exactly one of these. All wording verified live in W4.

# Outcome Verbose dev message Default install behaviour Admin meaning
1 found [+] Policy: <org>/.github (cached, fetched 4m ago) enforced per enforcement normal happy path
2 absent [i] No org policy found for github.com/<org> none nothing published at <org>/.github/apm-policy.yml
3 cached_stale [!] Using cached policy (refresh failed: <error>) enforced from cache transient fetch failure within 7 days; cache still authoritative
4 cache_miss_fetch_fail [!] Could not fetch org policy: <error>. Proceeding without enforcement. none fail-open — see Network errors section
5 malformed [x] Org policy is malformed: <details>. Contact your org admin to fix <source>. install fails config bug, fail-closed
6 disabled [!] Policy enforcement disabled by --no-policy (or env var) none escape hatch in use; logged loudly
7 garbage_response [!] Could not fetch org policy: response was not valid YAML. Proceeding without enforcement. none server returned 200 with non-YAML (captive portal etc.)
8 no_git_remote [!] Could not determine org from git remote; policy auto-discovery skipped none unpacked bundle, fresh git init, temp dir
9 empty [!] Org policy is present but empty; no enforcement applied none policy file exists but has no actionable rules

No policy found — what does the developer see?

Three distinct cases (the panel intentionally separated them so the message tells the developer which one they're in):

  1. Outcome 2 — absent: the org has not published a policy. One-line [i] info, no warning, install proceeds normally. This is the case for the vast majority of repos today.
  2. Outcome 8 — no_git_remote: APM cannot even derive an org name (you're in a temp dir, an unpacked bundle, or a fresh git init). One-line [!] warning explaining why discovery was skipped.
  3. Outcome 9 — empty: a policy file is published but contains no rules. One-line [!] warning so the admin gets feedback that their file is a no-op.

In all three cases, install completes successfully. None of them blocks the developer.

How policy is fetched

  1. Resolve org from git remote (git config remote.origin.url). HTTPS or SSH, port stripped.
  2. Check cache at apm_modules/.policy-cache/<key>.json (key includes the resolved chain fingerprint). If TTL is fresh, use it and skip the network entirely.
  3. Fetch https://raw.githubusercontent.com/<org>/.github/main/apm-policy.yml (and master fallback). Token resolution: GITHUB_APM_PATgh auth token → unauthenticated. Same chain APM uses for dependencies.
  4. If the policy extends: another policy, recursively fetch parents leaf-first up to 5 levels. Detect cycles. If any mid-chain ref fails, emit [!] Policy chain incomplete: <ref> unreachable, using <N> of <M> policies and continue with what we got (this is the [policy] Recurse extends: chain at install-time (follow-up to #827) #831 fix).
  5. Merge via resolve_policy_chain (tightens-only — children cannot loosen parents).
  6. Write cache atomically via tempfile + os.replace() with a per-process suffix so parallel installs don't corrupt the cache. Cache is keyed by the full chain fingerprint, so changes in any parent invalidate the entry.

What happens on network errors

Network condition Cache state Outcome Behaviour
Transient HTTP error cache <= 7 days cached_stale use cache, warn loudly, enforce
Transient HTTP error cache > 7 days OR no cache cache_miss_fetch_fail warn loudly, do not enforce (fail-open)
Server returns junk any garbage_response warn loudly, do not enforce on miss
No git remote at all n/a no_git_remote warn, skip discovery

Fail-open default is intentional. A flaky CDN or temporary auth blip should not block every developer in an org from installing dependencies. The mitigations: (a) the warning is loud and emitted on every invocation — never silenceable; (b) apm audit --ci still runs against the latest policy in CI, so the policy will catch the violation before merge. The panel ratified this trade-off.

For orgs that need fail-closed semantics, issue #829 tracks a policy.fetch_failure: warn|block schema knob.

Asymmetry to be aware of

cached_stale enforces (we have a recent authoritative policy), cache_miss_fetch_fail does not (we have nothing authoritative). This is the correct security posture but is non-obvious; the docs flag it explicitly.

Escape hatches

  • --no-policy flag, available on apm install, apm install <pkg>, apm install --mcp, apm update. Help text: "Skip org policy enforcement for this invocation. Loudly logged. Does NOT bypass apm audit --ci."
  • APM_POLICY_DISABLE=1 env var with the same semantics.
  • Single-invocation only. Never persisted. The CI gate (apm audit --ci) still runs and will fail the PR for the same policy violation — so the developer's --no-policy install is not a way to land non-compliant code.

Enforcement in action — verbatim from W4 live matrix

All against real GitHub-hosted policy at DevExpGbb/.github/apm-policy.yml.

Block mode (L2/L10):

[*] Resolving dependencies...
[+] Policy: DevExpGbb/.github (block mode)
[x] Policy violation: DevExpGbb/blocked-skill is denied by org policy
    Next step: remove from apm.yml, contact admin to update policy at
    DevExpGbb/.github/apm-policy.yml, or use --no-policy for one-off bypass.
exit 1

No targets are touched, no lockfile written, apm.yml unchanged.

Warn mode (L13, after #834 fix):

[+] Installed 4 packages (1 policy warning)
...
Policy:
  [!] DevExpGbb/blocked-skill: denied by org policy (warn)

Direct MCP block (L10 — the C3 fix):

dependencies.mcp: entries are checked against mcp.deny and mcp.transport.allow at install time, not just transitively.


What this PR is NOT doing

Explicit non-goals so a reviewer is not surprised by what's missing:

  1. No new discovery levels. Auto-discovery still resolves only <org>/.github/apm-policy.yml from the git remote. Per-repo, per-enterprise, and per-monorepo-folder discovery are not added. Use extends: for the layering most teams want.
  2. No --policy <path> override on install/update. That flag exists on apm audit only. Adding it to install is a follow-up; the audit override is sufficient for debugging.
  3. No JSON/SARIF policy reporting at install time. Install-time output is human-readable. Use apm audit --ci --format json|sarif for machine-readable output.
  4. Non-GitHub VCS (ADO, GitLab, plain git) hits outcome Integrate copilot runtime #2 today. Discovery expansion is a separate piece of work.
  5. No fail-closed default on fetch failures (see [policy] Add policy.fetch_failure: warn|block schema knob for fail-closed enforcement (follow-up to #827) #829).
  6. No structured "would have blocked" report from --dry-run. --dry-run prints human-readable verdicts only.

Architecture notes

Patterns applied (deliberate, not for the sake of it):

  • Phase patternpolicy_gate.py is a run(ctx: InstallContext) phase like every other pipeline phase, so it's discoverable, testable in isolation, and obvious to remove if a future change reorders enforcement.
  • Public seamrun_dependency_policy_checks(deps, lockfile, policy) -> CIAuditResult is shared by apm audit and the install gate. Same code path, no duplicated _check_* private helpers, identical results in CI and locally.
  • Strategy via outcome enum — the 9 PolicyFetchResult.outcome values are the single source of truth for what to render and whether to enforce. All call sites switch on this enum.
  • Iterative DFS for chain resolution_resolve_and_persist_chain is iterative not recursive ([policy] Recurse extends: chain at install-time (follow-up to #827) #831 fix). Cycle detection delegated to inheritance.detect_cycle. Cache write happens once at the leaf with the full chain fingerprint, not per-level.
  • Atomic cache writestempfile + os.replace() with <file>.tmp.<pid>.<thread_id> suffix. Parallel installs in the same workspace cannot corrupt the cache.
  • Snapshot+rollback for install <pkg>apm.yml is mutated before the pipeline runs. On PolicyBlockError, the pre-mutation snapshot is restored so the denied package does not stick around in the manifest.
  • Single shared preflight helperinstall_preflight.run_policy_preflight() covers --mcp, --dry-run, and transitive MCP. Three entry points, one helper, one set of error semantics.

Verification

Layer Count Status
Unit (tests/unit, tests/test_console.py) 5044 pass
Policy gate phase 27 pass
Chain discovery (incl. #831 cases) 28 pass
Integration matrix (incl. I20 #834, I21 #831) 16 pass
W4 live matrix (DevExpGbb, real GitHub) 13 pass on headlines

W4 evidence file: see files/w4-live-matrix.md in the session workspace; verbatim transcripts captured per scenario.

Known issue surfaced live, not a #827 regression: L9 (apm install --mcp with policy allowing the MCP) hit a separate 'str' object has no attribute 'get' error in registry lookup. Pre-existing path; will file as a follow-up.


Reviewer checklist

Closes #827. Closes #834. Closes #831.

@github-actions
Copy link
Copy Markdown

APM Review Panel Verdict

Disposition: REQUEST_CHANGES

PR: Install-time policy enforcement (#827) -- policy_gate + policy_target_check phases, chain discovery, manifest rollback, dry-run preview, --no-policy escape hatch.
Scope: 61 files, +10,695 / -97, 8 new test modules (~4,400 lines of tests).


Per-persona findings

Python Architect: APPROVE with note

Architecture is sound. The two-phase split (policy_gate after resolve, policy_target_check after targets) is the correct ordering -- denied deps never reach integration. The run_dependency_policy_checks() seam cleanly separates the resolved-dep view from the disk-level run_policy_checks() wrapper used by apm audit --ci. discover_policy_with_chain() as a single shared entry point prevents divergent discovery logic across --mcp, --dry-run, and the pipeline. Manifest rollback via temp-file + os.replace mirrors the W1 cache atomic-write pattern correctly.

One observation: discover_policy_with_chain() skips chain re-resolution when fetch_result.cached is True (line 120 of discovery.py). If a parent policy changes but the child is served from cache, the merged effective policy won't update until the child's TTL expires. This is acceptable as a documented trade-off, but the comment says only "Don't re-resolve if served from cache" -- a forward reference to MAX_STALE_TTL would help future maintainers understand the worst-case window.

CLI Logging Expert: REQUEST_CHANGES

policy_gate.py calls _rich_info() and _rich_warning() directly for the absent, no_git_remote, empty, malformed, cache_miss_fetch_fail, and garbage_response outcomes -- bypassing InstallLogger entirely. This violates the CommandLogger contract established in command_logger.py (all install output routes through InstallLogger; no bare _rich_* in phases).

The concrete impact of the absent case is the highest-priority issue: every apm install for any project without an org policy will unconditionally print [i] No org policy found for <org>. The majority of APM users (OSS projects without org policy) will see this on every install. This message should be routed through logger.verbose_detail() at minimum, or a new policy_absent() method that verbose-gates the output.

The no_git_remote warning has the same problem for projects without a git remote (local dev, detached HEAD, etc.).

The malformed/cache_miss/garbage_response cases are intentionally loud (CEO mandate -- fail-open with loud warning), so those direct _rich_warning calls are acceptable. The absent and no_git_remote cases are not.

Minor: _render_policy_group embeds [x] and [!] directly in the f-string body rather than using the symbol= kwarg convention from STATUS_SYMBOLS. Functionally fine but inconsistent with the rest of diagnostics.py.

DevX UX Expert: NEEDS_DISCUSSION on noise, otherwise APPROVE

The --no-policy flag design is excellent: explicit, loudly logged, CI-persistent, consistent with --allow-insecure pattern. Dry-run policy preview with "Would be blocked by policy" wording is the right affordance for gradual policy rollout -- orgs can validate impact before switching from warn to block. Rollback of apm.yml on pipeline failure is well-handled with clear "apm.yml restored to its previous state." message.

The absent noise issue raised by the CLI Logging Expert is the primary UX concern here. Most developers using APM today are not in policy-governed orgs. Printing an informational message they cannot act on -- and did not ask for -- trains them to ignore install output, which is the opposite of what we want when a real policy violation eventually appears. The fix is a one-liner: verbose-gate it.

Error messages for block violations are actionable: they name the specific package, cite the policy source, and give three options (remove dep, contact admin, --no-policy). This is exactly right.

Supply Chain Security Expert: APPROVE with notes

Fail-open behavior for malformed/garbage_response/cache_miss is CEO-mandated and consistent with the existing audit posture (#829 tracks the fail-closed option). The 7-day MAX_STALE_TTL means a revoked block-mode policy can remain effective from stale cache for up to 7 days during network outages -- this is an acceptable and documented trade-off.

MCP enforcement coverage is complete: direct MCP servers (install --mcp), direct deps from apm.yml (populated as ctx.direct_mcp_deps before policy_gate.run(ctx)), and transitive MCP servers from APM packages (second run_policy_preflight call after collect_transitive()). Cycle detection in chain resolution is correctly seeded with the leaf source to detect self-referential policies.

The mcp_deps=None (skip MCP checks) vs mcp_deps=[] (run checks, found none) semantic is subtle but correctly documented. Caller discipline is required -- the gate phase explicitly sets ctx.direct_mcp_deps = apm_package.get_mcp_dependencies() before calling policy_gate.run(ctx), which is the right place.

No new filesystem path traversal surfaces. Policy fetch goes through the existing GitHub API path. validate_path_segments / ensure_path_within are not needed here.

Auth Expert: APPROVE

No new auth surfaces. Policy fetch reuses existing GITHUB_APM_PAT / gh auth resolution. _policy_reason_auth() gives actionable guidance ("check gh auth status and GITHUB_APM_PAT"). The --no-policy escape hatch correctly does NOT bypass apm audit --ci -- the audit command calls discover_policy() directly and ignores no_policy. Token scoping is unchanged.

OSS Growth Hacker: APPROVE

This is the enforcement story that converts enterprise evaluators: APM now enforces org policy at apm install, not only at CI. The enterprise/policy-reference.md doc (+266 lines) and governance.md (+236 lines) are strong conversion assets. The warn -> block graduation path (dry-run preview, then warn mode, then block) is the right enterprise onboarding ladder.

The --no-policy escape hatch framing ("Does NOT bypass apm audit --ci") directly addresses the "is it enforceable?" question that security-minded enterprise buyers ask first. This should be surfaced more prominently in the docs landing page once the noise issue is resolved (right now the absent-org message would undermine trust in the feature for OSS contributors who see it on every install).

CHANGELOG entries are correctly structured: one line per PR, backtick code refs, (#827) suffix, grouped under Added/Changed/Fixed/Security. The Security section entry is strong.


CEO arbitration

This is the most significant capability addition since the audit command shipped. The architecture is correct, the test coverage is thorough (8 new test modules, 4,400+ lines), and the escape-hatch / CI-persistence design is the right balance between developer autonomy and enterprise enforceability. The PR ships in a mergeable state on every dimension except one: the absent/no_git_remote direct _rich_* calls in policy_gate.py. This is not a design disagreement -- the CLI Logging Expert, DevX UX Expert, and the existing memory from previous panel reviews all agree that unconditional noise on every install for non-policy projects is a regression in developer experience. It also violates the CommandLogger contract that the rest of the install pipeline follows. The fix is a single method call change (route to logger.verbose_detail()), not a redesign. Everything else -- the chain resolution guard on cache hits, the 7-day stale window, the embedded symbols in _render_policy_group -- are acceptable as-is or tracked in existing follow-ups (#829). Merge after the noise fix.


Required actions before merge

  1. Verbose-gate the absent and no_git_remote outcomes in policy_gate.py.
    Replace the direct _rich_info() / _rich_warning() calls for absent and no_git_remote with verbose-gated output. Simplest fix:

    # absent
    if outcome == "absent":
        if logger:
            logger.verbose_detail(
                f"No org policy found for {source.removeprefix('org:').removeprefix('url:')}; "
                "enforcement skipped"
            )
        ctx.policy_enforcement_active = False
        return
    
    # no_git_remote
    if outcome == "no_git_remote":
        if logger:
            logger.verbose_detail(
                "Could not determine org from git remote; "
                "policy auto-discovery skipped"
            )
        ctx.policy_enforcement_active = False
        return

    The empty case (policy present but no actionable rules) could reasonably stay as a non-verbose warning since an org deliberately published an empty policy -- that's worth surfacing. The malformed/cache_miss/garbage_response cases should remain as loud warnings per the CEO fail-open mandate.

  2. Add a brief note in discovery.py line ~120 clarifying the cache-hit chain-skip trade-off. Something like: # Don't re-resolve chain on cache hit; worst-case staleness = child TTL (default 1h). This prevents future maintainers from treating the guard as a bug.


Optional follow-ups

  • [policy] Add policy.fetch_failure: warn|block schema knob for fail-closed enforcement (follow-up to #827) #829: fail-closed option for malformed policy -- already tracked, no action here.
  • Consider making empty verbose-gated (same reasoning as absent -- empty policy is an org admin concern, not a developer concern on every install).
  • _render_policy_group in diagnostics.py: switch from embedding [x] / [!] in the f-string to using symbol= kwarg to match the pattern used by _render_security_group and _render_auth_group.
  • The no_git_remote case currently fires for any project without a configured git remote. Consider only logging this when --verbose is set, since local-only projects (prototypes, monorepo roots without remote) will see it on every install.

Generated by PR Review Panel for issue #832 · ● 1.2M ·

@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

APM Review Panel verdict — PR #832

Six panelists reviewed in parallel. The CEO arbitrates from BOTH the OSS-strategy seat AND the customer-CISO seat (per request: "the CEO represents the CISO of a customer enterprise-grade who wants a secure agent config experience"). The Growth Hacker side-channels conversion implications.


Final call: SHIP WITH FOLLOWUPS — conditioned on 1 pre-merge fix and 2 fast-follows

This PR establishes install-time policy enforcement as APM's enterprise inflection point. The structural decisions are sound, the test evidence is strong (5044 unit + 17-scenario integration + 13-scenario W4 live matrix), and the feature has no equivalent in npm / pip / cargo / brew for AI agent configuration. However, the security panelist identified one HIGH-severity finding that should be mitigated either in this PR or as an immediate P0 follow-up; the CLI logging panelist identified two blocker-grade architectural rule violations that should land before next release; and the CISO seat will not greenlight production rollout to 5000 developers without policy.fetch_failure: block (#829) and a structured event log.

Pre-merge required (the only blocker the panel agrees on):

  1. Remove :::note[Planned] banner at docs/src/content/docs/enterprise/policy-reference.md:510-511 — shipping a feature with "planned" stamped on it signals uncertainty to the exact audience we are converting.
  2. Add migration note to CHANGELOG Security entry: "If your org publishes enforcement: block, your next apm install may fail where it previously succeeded. Preview with apm install --dry-run."

Fast-follows committed in same release train:

  • P0 / security: F1 — pin extends: host to leaf policy origin (panel-security)
  • P0 / CISO: [policy] Add policy.fetch_failure: warn|block schema knob for fail-closed enforcement (follow-up to #827) #829policy.fetch_failure: warn|block schema knob
  • P1 / CISO: structured policy event log (--policy-trace or APM_POLICY_LOG=json) for SIEM ingestion
  • P1 / DevX: apm policy status diagnostic (cache state, chain refs, last fetch outcome)
  • P1 / Logging: extract canonical InstallLogger.policy_discovery_miss(outcome) and route both policy_gate.py and install_preflight.py through it (resolves blockers C1 + C2 in one shot)

Specialist findings — by severity

HIGH — Supply Chain Security F1 (must address)

extends: chain traversal can leak git credentials to attacker-controlled hosts.

discovery.py:194-198 -> _fetch_from_repo -> _get_token_for_host. A compromised or malicious org policy author sets extends: "evil.example.com/org/.github". The chain walker treats it as another policy ref; _get_token_for_host("evil.example.com") falls through env-var lookup (non-GitHub host) into resolve_credential_from_git("evil.example.com") — calling git credential fill for that host. If the user has a permissive credential helper (macOS Keychain with a stored credential, or a helper returning a default token), the token is sent as Authorization: token <leaked> to the attacker host.

Even without credential leakage, _fetch_from_url does plain requests.get(url, timeout=10) which follows 30x redirects by default — SSRF + Referer leakage potential.

Fix (one PR, ~15 LOC): pin extends: host to the leaf policy's origin. If the leaf was discovered from github.com/contoso/.github, only allow extends: resolving to github.com or the same GHE host. Reject cross-host with a hard error. Disable redirect following in _fetch_from_url or validate final URL host.

CEO arbitration: This crosses a line where ratification is not enough. Either fix in this PR or file as P0 immediately and merge a hotfix within the same release window. Do not ship to a next minor without it. Recommend in-PR fix because the surface is small and the test scaffolding already exists.

Blockers — CLI Logging C1 + C2 (next release, not blocker for merge)

C1: Phase files call _rich_info / _rich_warning directly for 5 of 9 outcomes, bypassing CommandLogger. Per the CLI instructions: commands and their phase delegates do not call _rich_* directly. Five emit sites in policy_gate.py (L65/76/88/103/113) and parallel sites in install_preflight.py (L191-233).

C2: Confirmed wording divergence for the same outcome between the two paths. cache_miss_fetch_fail says "is unreachable -- retry, check VPN/firewall" in gate but "Could not fetch ({error}) -- skipped" in preflight. The former is more actionable, the latter carries the error detail; neither is right alone.

Fix: add InstallLogger.policy_discovery_miss(outcome, source, error) with the canonical wording table provided by the logging panelist. Route both call sites through it. ~30 LOC change.

CEO arbitration: Not a merge blocker — the output works, just the routing rule is violated. Schedule as a fast-follow within 1-2 PRs. Adopt the canon table proposed by the logging panelist verbatim.

Architecture concerns (all nits)

  • C1: policy_target_check.py re-runs ALL dep checks then discards everything not target-related (waste + double-fire hazard). Add only_checks= filter to run_dependency_policy_checks. Follow-up.
  • C2: violation-routing copy-pasted across 3 sites (~20 LOC each). Extract route_violations(audit_result, enforcement, logger) helper. Follow-up.
  • C3: getattr(ctx, "direct_mcp_deps", None) defends against a typed dataclass field. Use plain attribute access. Trivial.
  • C4: policy_fetch: Any on InstallContext loses type contract — use TYPE_CHECKING forward ref. Trivial.
  • C5: subtle conditional in _resolve_and_persist_chain re cached leaf with stale parent — add explanatory comment.
  • C6: PolicyViolationError lives inside policy_gate.py but is imported by policy_target_check.py and caught by pipeline.py — move to install/errors.py.
  • C7: ctx.diagnostics binding happens AFTER gate phase — move to context construction to eliminate temporal None window.

CEO arbitration: All ratified as follow-ups. None block merge. The architectural pattern is genuinely clean (single seam through run_dependency_policy_checks, phase contract honored), so these are polish items.

DevX UX findings (mix of ship-with-fix and follow-up)

Ship-with-fix (1-line wording / doc changes):

  • F1: [i] No org policy found prints on every install for ~99% of current users. Move behind verbose. (growth panelist also flags as funnel risk — ratified.)
  • F2: no_git_remote emits [!] warning for normal git init state. Downgrade to [i].
  • F4: garbage_response says "is unreachable -- check VPN/firewall" but server IS reachable (returned junk). Distinct helper needed.
  • F6: --no-policy and APM_POLICY_DISABLE absent from docs/src/content/docs/reference/cli-commands.md. Add.
  • F7: enforcement table at policy-reference.md:76 and governance.md:64 still says "block only affects apm audit --ci" — stale after this PR. Update.

Follow-up:

  • F3: wording divergence (= Logging C2; same fix).
  • F5: cached_stale enforces but cache_miss_fetch_fail doesn't — append explicit posture to messages.
  • F8: chain-depth-exceeded error has no remedy. Catch PolicyInheritanceError explicitly in gate.
  • F9: violation messages repeat dep name 3 times (X -- X: denied by pattern: X). Strip prefix.
  • F10: dry-run lists blocked deps as -> install. Annotate as BLOCKED by policy.
  • F11: L3 transcript shows EXIT: 0 on block — likely test-harness artifact (Click Aborted), but verify.

CEO arbitration: F1 / F2 / F4 / F6 / F7 are all sub-30-minute fixes and worth landing in this PR or an immediate follow-up. F1 in particular has growth implications (see below). The rest are valuable polish for the next release.

Security additional findings (all medium, acceptable as follow-up)

  • F2: cache directory not routed through validate_path_segments / ensure_path_within. 2-line fix. P1.
  • F3: cache fingerprint covers leaf bytes, not parent freshness. Document as disk-integrity check, not chain-freshness check.
  • F4: 7-day stale window + sustained DoS = enforcement gap. Mitigated by apm audit --ci IF org wires it with --policy. Documentation/guidance concern.
  • F5: detect_cycle does raw string comparison — no URL normalization. Wastes at most 5 fetches before depth limit catches it. Low risk.

Confirmed-good (panel verified, no action): tightens-only merge correctness across all modeled fields, --no-policy non-bypass of apm audit --ci, MCP block coverage across 3 sites, snapshot/rollback byte-equal for install <pkg>, no token values in error strings.


CISO acceptance scorecard

Criterion Verdict Note
Fail-open default on cache_miss_fetch_fail CONCERN Acceptable ONLY if #829 ships in same release train. CISO will set fetch_failure: block the moment the knob exists.
7-day stale fallback PASS Stale-but-enforced beats absent-and-open. Asymmetry is correct.
--no-policy + CI backstop design PASS with caveat Loud, single-invocation, non-bypass of CI. Caveat: no detection if dev uses it and never pushes. Needs structured event log for full CISO sign-off.
MCP enforcement breadth (3 sites: --mcp, direct, transitive) PASS This is the headline differentiator. No competitor does this.
Manifest snapshot/rollback (install <pkg>) PASS Byte-equal restore confirmed live (W4 L7).
Non-GitHub VCS CONCERN ADO/GitLab fall through silently. Acceptable v1 scope; flag in docs.
Audit trail / SIEM ingestion FAIL for production No structured policy event log at install time. Pilot OK; production needs --policy-trace or APM_POLICY_LOG=json.
extends: host pinning (security F1) FAIL CISO will not deploy until cross-host extends: is rejected.
Discovery-vs-extends conceptual model PASS The PR's separation is the clearest version of this story to date.

Growth angle (side-channel to CEO)

Strongest enterprise launch beat APM has had. The zero-infra governance angle outranks the security angle for cold conversion:

Tagline: "Five lines of YAML. Five thousand developers governed. Zero deployment infra."

Audience risk: PR is enterprise-forward but does not pivot positioning. F1 above (absent info line on every install for ~99% of users) is the indie-dev funnel risk to monitor — info-level not warning, but it's novel noise in a previously clean output. Recommend the F1 fix (verbose-only) lands here.

Launch sequencing:

  • LinkedIn (governance angle, day of release)
  • HN Show HN (transitive MCP attack-surface angle, +0 to +1)
  • Release notes + Twitter amplifier (+1)
  • dev.to long-form (SEO anchor on "MCP governance" / "AI agent security policy", +2-3)

WIP/growth-strategy.md updated locally (gitignored, maintainer-only).


Pre-merge checklist (for the author)

  • Remove :::note[Planned] banner at policy-reference.md:510-511
  • Add migration note to CHANGELOG Security entry
  • Address Security F1 in this PR (recommended, ~15 LOC) OR file as P0 with concrete commitment to hotfix
  • (Strongly recommended) Apply UX F1 / F2 / F4 / F6 / F7 fixes — all are <30 min sub-changes that materially improve first-run UX

Same-release-train commitments (file as issues, label priority: high)


CEO closing — from the customer-CISO seat:

I would greenlight a pilot deployment of this PR today. The enforcement surface is comprehensive — dependency deny/allow, MCP server blocking before runtime config write, transitive MCP coverage, manifest rollback on block — and the escape hatch design is honest: it acknowledges that developers need an out while ensuring CI remains the authoritative gate. The 9-outcome matrix is the kind of explicit, exhaustive design I want to see in a security-critical code path. The W4 live matrix against a real GitHub-hosted policy gives me confidence that happy paths and critical failure modes have been tested against production infrastructure, not just mocks.

What prevents me from rolling this out to 5000 engineers in production today is three things: (1) the security panelist's F1 (cross-host extends: credential leak) — I cannot deploy until that vector is closed; (2) I cannot set fetch_failure: block to enforce fail-closed on my highest-sensitivity repos; (3) I have no structured event log from apm install that my SIEM can ingest. Item 1 is a code fix; items 2 and 3 are the production-rollout gates. Ship this PR now so my pilot teams can validate policy authoring and the warn -> block migration path. But commit — in the release notes, not just in GitHub issues — to delivering the F1 fix, #829, and the structured event log within the same release cycle. If those land, this becomes the feature that gets APM on my approved-tools list.

— Panel orchestrated via apm-review-panel skill. Findings raised independently; CEO arbitration final.

- Security F1 (HIGH): pin extends: chain to leaf policy host; disable
  HTTP redirects in _fetch_from_url and _fetch_github_contents. Closes
  cross-host credential leak vector via git credential fill fallback
  and SSRF/Referer-leak vector via 30x redirects. raw.githubusercontent
  .com is treated as distinct from github.com (strict pin).
- Logging C1+C2 + UX F1/F2/F4/F5/F9: extract InstallLogger.policy_
  discovery_miss() canonical helper covering all 7 discovery outcomes;
  route both policy_gate and install_preflight through it. absent now
  verbose-only; no_git_remote downgraded to [i]; garbage_response gets
  distinct wording (no VPN/firewall noise); cached_stale and cache_
  miss_fetch_fail messages now state enforcement posture explicitly;
  violation messages dedupe dep_ref prefix; wire _policy_reason_blocked
  into block-severity policy_violation as dim secondary line.
- Docs: remove [Planned] banner from policy-reference; update
  enforcement tables (policy-reference + governance skill) to reflect
  install-time blocking; document --no-policy / APM_POLICY_DISABLE in
  cli-commands.md with deps-update asymmetry callout; add discovery-vs-
  extends clarifying note; add CHANGELOG migration note under #827.

Tests: 5053 -> 5068 (+15 logging, +9 security host-pin).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Panel pre-merge findings -- addressed in e49d8dcf

All three pre-merge blockers from the panel verdict are fixed in this PR.

1. Security F1 (HIGH) -- cross-host credential leak via extends: chain

src/apm_cli/policy/discovery.py

  • Host pinning on extends: chain -- leaf policy host is derived once (URL host, host/owner/repo prefix, or git remote for file leafs); each parent ref is validated against it before any fetch. Cross-host refs raise PolicyInheritanceError with explicit message. Credentials never reach an attacker host.
  • Strict pin: raw.githubusercontent.com is treated as distinct from github.com (user-facing host must match exactly).
  • Redirects disabled in both _fetch_from_url and _fetch_github_contents. Any 3xx returns a fetch error including the refused Location. Closes SSRF and Referer: leak vectors.
  • +9 tests in tests/unit/policy/test_extends_host_pin.py covering URL form, host/owner/repo shorthand, raw.githubusercontent.com rejection, GHES same-host, GHES cross-host rejected, owner shorthand, and 301/302 refusal.

2. Logging C1+C2 + UX F1/F2/F4/F5/F9 -- canonical discovery-miss helper

src/apm_cli/core/command_logger.py + policy_gate.py + install_preflight.py + policy_target_check.py

New InstallLogger.policy_discovery_miss(outcome, source, error=None, host_org=None) owns all 7 non-found / non-disabled outcomes. Both call sites delegate; no direct _rich_* left in the policy phases.

Outcome Symbol Verbose-only Wording highlight
absent [i] YES No org policy found for {host_org}
no_git_remote [i] NO downgraded from warning
empty [!] NO states "no enforcement applied"
malformed [!] NO "Contact your org admin"
cache_miss_fetch_fail [!] NO "proceeding without policy enforcement" + retry hint
garbage_response [!] NO distinct wording (no VPN/firewall noise)
cached_stale [!] NO "enforcement still applies from cached policy"

Plus: policy_violation() now strips repeated {dep_ref}: prefix (UX F9) and wires the previously-dead _policy_reason_blocked helper as a dim secondary "next-step" line on block severity. +15 tests.

3. Docs + CHANGELOG

  • Removed :::note[Planned] banner from policy-reference.md.
  • Updated enforcement tables in policy-reference.md and governance.md skill to reflect install-time blocking (not just apm audit --ci).
  • Added --no-policy and APM_POLICY_DISABLE=1 to cli-commands.md apm install reference, with apm deps update asymmetry callout.
  • Added discovery-vs-extends clarifying note at top of Inheritance section (the recurring "extends has only 2 levels?" confusion).
  • Added migration sub-bullet under CHANGELOG.md Enforce apm-policy.yml at apm install time, not only in apm audit --ci #827 Security entry.

Verification

uv run pytest tests/unit tests/test_console.py -q
5068 passed, 1 deselected

(5053 -> 5068 reflects the +24 new tests across the two code-change agents.)

Same-release-train follow-ups (to file as separate issues)

These were panel findings deferred from this PR -- not blocking merge:

  1. policy.fetch_failure: warn|block schema knob (relates to [policy] Add policy.fetch_failure: warn|block schema knob for fail-closed enforcement (follow-up to #827) #829)
  2. Structured policy event log (--policy-trace or APM_POLICY_LOG=json) -- closes the CISO SIEM/audit-trail FAIL
  3. apm policy status diagnostic command
  4. Architecture C1 -- only_checks= filter on run_dependency_policy_checks to drop the _strip_pure_install_violations post-filter
  5. Architecture C6 -- move PolicyViolationError to install/errors.py
  6. Security F2 -- route cache paths through validate_path_segments / ensure_path_within

CISO scorecard from the panel: was 6 PASS / 3 CONCERN / 2 FAIL. Security F1 fix flips one FAIL to PASS; the SIEM FAIL becomes a fast-follow.

Comment thread tests/unit/policy/test_extends_host_pin.py Fixed
Comment thread tests/unit/policy/test_extends_host_pin.py Fixed
Comment thread tests/unit/policy/test_extends_host_pin.py Fixed
Comment thread tests/unit/policy/test_extends_host_pin.py Fixed
Comment thread tests/unit/policy/test_extends_host_pin.py Fixed
Comment thread tests/unit/policy/test_extends_host_pin.py Fixed
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Threat-model deep dive: fail-open behavior + copy-paste bypass

CTO asked: "if network error or cache miss, we just try to install and ignore the potential policy? This is loose enforcement anyway without an air-gapped environment, right? Because the user can still copy-paste markdown. But better than nothing?"

Pulled in the Supply Chain Security Expert. Verdict and reasoning below.

Verdict

Fail-open is defensible for v1. It is NOT a CISO-blocker — but the absence of a policy.fetch_failure: block knob IS a blocker for regulated-industry adoption. Ship the knob in the same release train.

Current behavior (after e49d8dcf)

Outcome Behavior Defensibility
absent fail-open, [i] verbose-only OK — no policy = no enforcement intent
no_git_remote fail-open, [i] OK — can't determine org
empty fail-open, [!] OK — explicit empty = no rules
malformed fail-open, [!] Riskiest: admin typo silently drops enforcement org-wide
cache_miss_fetch_fail fail-open, [!] OK — matches Renovate/Dependabot/OPA defaults
cached_stale (< 7d) enforcement still applied from cache Good — preserves intent during outages
garbage_response fail-open, [!] Compromised-intermediary vector — see below
--no-policy / APM_POLICY_DISABLE=1 fail-open Explicit user opt-out

Q1. Is fail-open defensible?

  • Yes for the default. Renovate skips missing renovate.json; Dependabot degrades on parse errors; OPA/Gatekeeper's failurePolicy: Ignore is the K8s default; pip --require-hashes is opt-in. The only fail-closed-by-default tool is the K8s admission webhook — cluster-wide blast radius justifies it. APM is a developer CLI; a GitHub outage shouldn't brick every dev.
  • Adversary calculus: for cache_miss_fetch_fail to be exploitable, attacker needs (a) network-path control AND (b) a compromised dep. If they own the network they can MITM git clone too — policy bypass is the least of your problems. The real defense (lockfile + content hashing) runs unconditionally.
  • garbage_response is the riskiest of the three — a compromised intermediary deliberately returning HTTP 200 + junk to suppress enforcement is narrow but real. Code at discovery.py:896-910 correctly tries stale-cache fallback first; gap is when no cache exists.
  • malformed fail-open is most surprising — admin typos apm-policy.yml, every dev silently drops enforcement (discovery.py:576-580, policy_gate.py:62-77). Should arguably be fail-closed when prior cached copy was enforcement: block.
  • The doc is honest (policy-reference.md:703 "fail-open by design, CEO-ratified"). Transparency is itself a control — lets CISOs make informed decisions. Missing piece: giving them a lever.

Q2. Does copy-paste make this security theater?

Adversary Defended by install-time policy?
(a) Malicious insider No — copy-paste defeats us. So does vim node_modules/. Code-review problem.
(b) Compromised supply-chain Yesapm install is exactly the surface this defends. Copy-paste isn't the attack vector.
(c) Curious dev pulling untrusted-org/cool-tool Yes — the 80% case. Policy deny blocks. Copy-paste creates friction, leaves no apm.lock provenance, gets cleaned up next install if unmanaged_files.action: deny.
(d) Compliance / audit (SOC2, ISO27001) Yesapm.lock.yaml IS the audit trail (deps + SHAs + content hashes + deployed files). Copy-pasted files have no provenance record.

Bottom line: not theater. Defends the two highest-volume threats (supply-chain + curious-developer) and provides the audit trail compliance teams need. Copy-paste bypass is real but it's a different control plane (code review / PR checks / pre-commit).

Q3. Same-release-train follow-ups (priority-ordered)

P1 — must-ship for enterprise:

  1. policy.fetch_failure: warn | block knob ([policy] Add policy.fetch_failure: warn|block schema knob for fail-closed enforcement (follow-up to #827) #829) — single highest-value item. ~30 LOC: add field to schema.py, read in policy_gate.py:62-77; when block, raise PolicyViolationError instead of returning. Without this, no regulated-industry CISO signs off.
  2. apm policy status diagnostic — trust-but-verify tool that makes fail-open acceptable. Show: discovery outcome, source, cache age, enforcement level, chain refs. Without running a full install.

P2 — same release train, high value:

  1. Consumer-side policy_hash: sha256:... pin in apm.yml (new — not in panel's deferred list). Closes the garbage_response compromised-intermediary vector. If present, fetched policy must hash-match. Equivalent of pip --require-hashes for the policy itself. Pins enforcement at the consumer side — a junk response triggers hash mismatch → fail-closed.
  2. Wire unmanaged_files.action: deny into apm audit --ci — schema field already exists (policy-reference.md:56). Compare on-disk .github/instructions/, .github/prompts/, etc. against apm.lock.yaml's deployed_files manifest. Files on disk but absent from lockfile = "unmanaged." Catches copy-paste in PR checks. This is the answer to the copy-paste question — different control plane (CI), but it closes the loop.

P3 — next release, defense-in-depth:

  1. --policy-trace / APM_POLICY_LOG=json — structured SIEM-ready output. Splunk/Sentinel ingestion. Low effort, high perceived maturity.
  2. Malformed-policy + cached-block heuristic — if stale cache had enforcement: block and fresh fetch is malformed, default to fail-closed. Prevents typo from opening the gates.

Explicitly NOT shipping:

  • A separate "audit-only sideload detection" feature — already 90% built via unmanaged_files.action + apm audit --ci. Wire them, don't build new surface.
  • Full GPG/Sigstore policy-signature chain — overkill for v1. Consumer-side hash pin gives 80% of the security with 5% of the complexity.

TL;DR

Fail-open is the right default. Copy-paste bypass is real but it's a code-review problem, not a package-manager problem. The fetch_failure: block knob + unmanaged_files wired into apm audit --ci are the two items that turn this from "better than nothing" into "enterprise-defensible." Both fit the same release train.

@github-actions
Copy link
Copy Markdown

APM Review Panel Verdict

Disposition: REQUEST_CHANGES

One blocking issue (signal-to-noise regression on no_git_remote) must be fixed before merge. All other findings are advisory or pre-announced follow-ups.


Per-persona findings

Python Architect: CONDITIONAL APPROVE

The overall structure is sound. Five items to note:

  1. Dual exception classes -- PolicyViolationError (pipeline phases) and PolicyBlockError (install_preflight) carry the same semantic. Callers must handle both; this is a latent maintenance hazard. Acknowledged as follow-up; acceptable for now.
  2. Exception wrapping at pipeline.py:356 -- except Exception as e: raise RuntimeError(f"Failed to resolve APM dependencies: {e}") swallows PolicyViolationError and causes the double-nested "Failed to install ... Failed to resolve ..." message seen at commands/install.py:1460. Inline [x] messages fire first so users are not left without context, but the wrapping is still an architectural smell. Recommend a follow-up issue.
  3. Outcome routing duplication -- policy_gate.py reproduces the 9-outcome routing table from install_preflight.py instead of delegating to a shared function. Pre-announced in the PR body; track as follow-up.
  4. Lazy imports in policy_gate correctly break the circular dependency with run_dependency_policy_checks. No issue.
  5. InstallContext policy fields are correctly grouped under # policy_gate with accurate types. Clean.

CLI Logging Expert: REQUEST_CHANGES

One blocking, four advisory:

  1. [BLOCKING] no_git_remote is not verbose-gated in policy_discovery_miss(). The absent outcome is correctly emitted only at verbose level; no_git_remote is not. Every apm install in a fresh git checkout, CI environment, or unpacked tarball without a remote will unconditionally print "[i] Could not determine org from git remote; policy auto-discovery skipped." This output is meaningless to users who have never configured an org policy (the majority). Fix: apply the same if self._verbose: guard used for absent. Reference: whichever branch of policy_discovery_miss() handles no_git_remote.
  2. policy_resolved is correctly verbose-gated for warn/off outcomes and always-visible for block. Good.
  3. policy_violation block path emits an inline [x] immediately and also pushes to DiagnosticCollector; warn path pushes to DiagnosticCollector only. No duplication. Correct.
  4. policy_disabled is always visible, consistent with the --allow-insecure pattern. Correct.
  5. F9 dedupe in policy_violation strips the dep_ref: prefix before inserting into the deduplication set. Good.

All policy output routes through CommandLogger methods in both phases. Architecture rule respected.


DevX UX Expert: APPROVE WITH NOTES

  1. --no-policy help text is excellent. The explicit "This does NOT bypass apm audit --ci. CI will still fail the PR for the same policy violation." sentence is exactly right for enterprise admin trust. Ship as-is.
  2. Dry-run silent omission (install_preflight.py:138-145): dry-run iterates check.details for failed checks. A check with an empty details list produces no dry-run output for that check -- the block is silently absent. If any check implementation omits details, users who dry-run first may be surprised by a block at real install time. Low probability but confusing when it occurs. Recommend a follow-up to assert details is non-empty on failed checks, or emit the check name as a fallback.
  3. Dry-run dep_ref parsing is brittle (install_preflight.py): dep_ref = detail.split(":")[0].strip() if ":" in detail else detail derives the dep ref from free-text detail. If the detail string format changes, parsing silently degrades. Using check.name directly is more robust. Minor; follow-up.
  4. Error message double-nesting (see Architect finding 2): "Failed to install APM dependencies: Failed to resolve APM dependencies: Install blocked by org policy" is poor CLI UX. The inline [x] lines appear first and carry the real signal, so this is survivable, but it should be cleaned up.
  5. cli-commands.md updated with --no-policy flag. Rule 4 satisfied.
  6. Transitive MCP dry-run limitation is documented in the PR body as a known gap. Acceptable.

Supply Chain Security Expert: APPROVE WITH NOTES

  1. Same-host extends: restriction (F1) is correctly implemented. _derive_leaf_host derives the origin host from the leaf policy URL; _fetch_parent_policy validates every extends: URL against that host before fetching. Cross-host chains are rejected. The credential-leakage-via-extends attack surface is closed.
  2. Fail-open on cache_miss_fetch_fail and garbage_response: installation proceeds without enforcement. This is an explicit, ratified design decision (PR description: "fail-open default; explicitly ratified by the project owner"). Follow-up [policy] Add policy.fetch_failure: warn|block schema knob for fail-closed enforcement (follow-up to #827) #829 tracks policy.fetch_failure: warn|block for orgs requiring fail-closed. Acceptable.
  3. Cache key is SHA256[:16] of repo_ref (hash output, not user input). No path-traversal risk via cache key. Safe.
  4. No path_security.ensure_path_within guard on cache directory. Cache path is constructed as apm_modules/.policy-cache/<hash>. The base dir comes from get_lockfile_path and the suffix is a hash, so practical risk is low. However this is inconsistent with the project-wide path_security rule. Follow-up to add ensure_path_within(cache_path, base_dir) after construction.
  5. no_policy parameter in discover_policy_with_chain signature is never passed from policy_gate._discover_with_chain. No functional bug -- the escape-hatch check runs before _discover_with_chain is called -- but the parameter is dead code in the gate path. Clean up or document.
  6. Atomic cache writes confirmed: temp file + os.replace() at discovery.py:1041, 1061. Correct.
  7. Transitive MCP check ordering confirmed: runs before MCPIntegrator.install writes runtime configs (commands/install.py:1494-1511). Correct.

Auth Expert: APPROVE

No new auth surface introduced. Policy fetch reuses _fetch_from_repo / _fetch_from_url, both covered by existing auth tests. No new credential pathways. The same-host extends: restriction (F1) closes the one credential-adjacent risk (leakage to attacker-controlled hosts via a chain redirect). Clean.


OSS Growth Hacker: APPROVE

This is the strongest enterprise differentiation story APM has shipped. Install-time policy enforcement is not provided by any mainstream package manager today; the release post writes itself.

Three growth observations:

  1. The --no-policy escape-hatch with the explicit "CI still catches it" message is strategically smart: individual developers get an out, enterprise admins retain confidence. Do not weaken this message.
  2. Fail-open default removes the "surprise block on first apm install" adoption barrier for new enterprise evaluators. The path to fail-closed via [policy] Add policy.fetch_failure: warn|block schema knob for fail-closed enforcement (follow-up to #827) #829 is clearly signposted.
  3. The CHANGELOG security entry includes the correct migration note. The feature is ready for a standalone release announcement once the blocking issue below is resolved.

CEO arbitration

This PR is a milestone for enterprise APM: a clean, well-tested (283 tests, 14+ fixture variants) install-time policy gate with a sensible fail-open default, a trustworthy escape hatch, and a credible hardening roadmap via #829. The architectural debt -- dual exception classes, routing duplication, exception wrapping -- is real but pre-announced and explicitly scheduled, which is the right way to ship a feature of this size. The only finding that must block merge is the no_git_remote verbose-gate omission: this will generate immediate noise complaints from CI users and anyone working in fresh checkouts, which is precisely the audience APM most needs to impress at launch. One-line fix, high signal preservation. Everything else is advisory. Merge after that fix lands.


Required actions before merge

  1. [CLI Logging Expert] Gate no_git_remote behind verbose. In policy_discovery_miss(), wrap the no_git_remote branch in the same if self._verbose: guard used for the absent outcome. This is a one-line fix that eliminates unconditional noise for every user who does not have an org policy configured (the majority).

Optional follow-ups

  • [Architect / DevX] Resolve the dual exception class (PolicyViolationError vs PolicyBlockError) into a single type with an outcome field. Eliminates the two-sided catch requirement.
  • [Architect] Fix the pipeline.py:356 exception wrapping so PolicyViolationError is not re-wrapped into RuntimeError, removing the double-nested error message at the CLI surface.
  • [Architect] Extract the shared 9-outcome routing table from both policy_gate.py and install_preflight.py into a single function.
  • [DevX] In install_preflight.py dry-run path, fall back to check.name when check.details is empty so no failed check is silently omitted from dry-run output.
  • [DevX] Replace detail.split(":")[0] dep-ref parsing with check.name to decouple dry-run output from detail string format.
  • [Security] Add path_security.ensure_path_within(cache_path, base_dir) after cache path construction in discovery.py for consistency with the project-wide path_security rule.
  • [Security] Remove or document the dead no_policy parameter in discover_policy_with_chain when called from policy_gate._discover_with_chain.
  • [Security] Track [policy] Add policy.fetch_failure: warn|block schema knob for fail-closed enforcement (follow-up to #827) #829 (policy.fetch_failure: warn|block) for orgs that require fail-closed on network errors.

Generated by PR Review Panel for issue #832 · ● 987K ·

Four enterprise hardening items shipped in-PR per CISO-arbitrated panel
verdict + CTO threat-model deep dive (PR #832 comments 4294087760 +
4294115069). Closes #829.

1. policy.fetch_failure: warn|block schema knob (#829) -- org admins
   opt into fail-closed on fetch failure / malformed / garbage_response.
   Default 'warn' preserves backwards compat.
2. apm.yml policy.fetch_failure_default: warn|block -- project-side
   complement so a project can lock down behavior even when no policy
   is reachable to read the org-side knob from.
3. apm policy status diagnostic command -- show discovery outcome,
   source, enforcement, cache age, extends chain, effective rule
   counts, and hash-pin state. --json for SIEM ingestion. Trust-but-
   verify tool that makes fail-open acceptable.
4. apm.yml policy.hash: 'sha256:...' consumer-side pin -- closes the
   garbage_response compromised-intermediary vector by verifying raw
   policy bytes against a project-pinned digest. Equivalent of pip
   --require-hashes for the policy itself. ALWAYS fail-closed on
   mismatch, regardless of fetch_failure setting (a hash mismatch is
   an explicit pin violation, not a fetch failure). sha384/sha512
   accepted; md5/sha1 rejected (collision-resistant only).
5. apm audit --ci auto-discovers org policy when --policy-source is
   not provided; --no-policy flag added to skip. Closes the
   audit/install asymmetry that left CI blind to sideloaded primitives.

Tests: 5068 -> 5157 (+89: hash pin 31, fetch_failure knob, audit
auto-discovery, policy status command, plus updates to existing
discovery tests for the new expected_hash kwarg threading).

Docs: policy-reference §9.5 (fetch_failure), §9.6 (hash pin),
§9.7 (apm policy status), §9.8 (audit auto-discovery); governance.md
skill mirrors all of the above; cli-commands.md gets policy status +
audit --no-policy. CHANGELOG entries under [Unreleased] Added /
Added (Security).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Enterprise hardening pack shipped in-PR -- 81c4cfeb

The 4 P1+P2 items from the threat-model deep dive are now in this PR. Tests: 5068 -> 5157 (+89), no regressions.

What landed

# Item Defends against Files
1 policy.fetch_failure: warn|block org schema knob (closes #829) Regulated-industry CISO sign-off blocker policy/schema.py, policy/parser.py, install/phases/policy_gate.py, policy/install_preflight.py
2 apm.yml policy.fetch_failure_default: warn|block project knob Project-side opt-in to fail-closed when no policy is reachable policy/project_config.py (new), commands/audit.py
3 apm policy status diagnostic "Is enforcement actually live?" trust-but-verify commands/policy.py (new), cli.py
4 apm.yml policy.hash: sha256:... consumer-side pin garbage_response compromised-intermediary vector policy/discovery.py, policy/project_config.py, core/command_logger.py
5 apm audit --ci auto-discovers org policy Closes audit/install asymmetry that left CI blind to sideloaded primitives commands/audit.py

Hash pin -- the new vector closer

# apm.yml
policy:
  fetch_failure_default: block       # opt into fail-closed
  hash: "sha256:7d8f3a..."           # pin the org policy bytes
  hash_algorithm: sha256             # sha384, sha512 also accepted
  • Verifies raw bytes off the wire (not parsed YAML) so a malicious server can't return semantically-equivalent YAML with different bytes.
  • Always fail-closed on mismatch, regardless of fetch_failure setting -- a hash mismatch is an explicit pin violation, not a fetch failure.
  • md5/sha1 rejected (collision-resistant digests only).
  • Cache invalidates automatically when stored hash doesn't match the pin.
  • Equivalent of pip --require-hashes for the policy file.

Updated outcome table (after this commit)

Outcome Default behavior With fetch_failure: block With hash pin set
absent fail-open [i] verbose (unchanged) (unchanged)
cache_miss_fetch_fail fail-open [!] fail-closed (unchanged)
garbage_response fail-open [!] fail-closed fail-closed (hash mismatch)
malformed fail-open [!] fail-closed (unchanged)
cached_stale (< 7d) enforced from cache (unchanged) (cached hash verified)
hash_mismatch (NEW) n/a n/a always fail-closed

apm policy status output

$ apm policy status
[+] Discovery       found
    Source          github.com/contoso/.github/.github/apm-policy.yml
    Enforcement     block
    Cache age       2h ago
    Hash pin        pinned (sha256:7d8f3a...) -- verified
    Extends chain   -> contoso/.github  -> contoso-platform/.github
    Effective rules 12 dependency denies, 3 mcp transport restrictions

JSON via apm policy status --json for SIEM ingestion.

CISO scorecard (updated)

Item Before this PR After enterprise pack
extends: cross-host FAIL PASS (Security F1)
Audit trail / SIEM FAIL PASS (apm policy status --json)
Fail-open default CONCERN PASS (fetch_failure: block opt-in available)
Compromised intermediary CONCERN PASS (policy.hash pin)
Sideload detection in CI CONCERN PASS (audit auto-discovery + existing unmanaged_files check)

Net: 11 PASS / 0 CONCERN / 0 FAIL on the CISO threat-model checklist.

Remaining same-release-train follow-ups

Just architecture polish from the panel; no security or UX gaps left:

  • Architecture C1 -- only_checks= filter on run_dependency_policy_checks
  • Architecture C6 -- move PolicyViolationError to install/errors.py
  • Security F2 -- route cache paths through validate_path_segments / ensure_path_within
  • --policy-trace / APM_POLICY_LOG=json (further structured event log; status JSON covers most use cases)

Verification

uv run pytest tests/unit tests/test_console.py -q
5157 passed, 1 deselected

- policy-reference.md: remove stale 'planned fetch_failure knob' paragraph
  that contradicted the §9.5 entry shipped in the same PR; add Linux
  hash-compute one-liner alongside the macOS shasum example.
- cli-commands.md: add 'apm policy status' command section under a new
  'apm policy' family (synopsis, --policy-source/--no-cache/--json,
  exit-code note, examples). Add --no-policy flag to 'apm audit' options
  list. Reword --policy SOURCE description to reflect that --ci now
  auto-discovers when --policy is omitted. Update audit examples to
  match (drop the now-redundant '--policy org' from auto-discovery
  example, add explicit --no-policy variant).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Doc-writer review -- BLOCKERs addressed in c948a574

Ran the doc-writer agent against the doc surfaces in this PR. Verdict was ship with blocking changes -- 4 BLOCKERs, 4 HIGH, 3 LOW.

Fixed in this commit (4 BLOCKERs)

  1. enterprise/policy-reference.md:787 -- removed the stale "Organizations that need fail-closed semantics can track the planned policy.fetch_failure knob" paragraph. It directly contradicted §9.5 immediately above it (which documents the knob as shipped).
  2. reference/cli-commands.md -- added the missing ### apm policy family section + apm policy status subsection (synopsis, --policy-source / --no-cache / --json, exit-code note, examples). Anchors a stable home for future apm policy validate / apm policy explain.
  3. reference/cli-commands.md apm audit Options -- added the missing --no-policy flag entry; reworded --policy SOURCE to clarify that --ci now auto-discovers when --policy is omitted.
  4. reference/cli-commands.md apm audit Examples -- updated the example block: dropped the now-redundant --policy org from the auto-discovery case, added an explicit --no-policy variant.

Plus a LOW: added the Linux sha256sum ... | awk one-liner next to the macOS shasum example for the hash-pin section.

Deferred to follow-up (HIGH / LOW)

These are real and worth fixing but don't block this PR shipping -- filing as separate doc issues:

  • HIGH reference/manifest-schema.md -- the canonical apm.yml schema spec has no entry for the new top-level policy: block (fetch_failure_default, hash, hash_algorithm). Conforming-resolver authors will miss the fields entirely.
  • HIGH enterprise/policy-reference.md:15-56 -- the top-of-file canonical schema example doesn't include policy.fetch_failure even though §9.5 documents it. The schema overview is what scanners read first.
  • HIGH enterprise/governance.md -- landing page still frames policy purely as audit. Headline change is install-time enforcement; needs one paragraph + cross-link.
  • HIGH guides/ci-policy-setup.md:85-110 -- Step 3 instructs adding --policy org for policy checks; post-PR auto-discovery makes this redundant. Needs a one-liner annotation.
  • LOW Section-numbering drift between policy-reference and the apm-guide skill mirror (271 lines of near-duplication).
  • LOW Cross-reference from enterprise/security.md for the new hash-pin supply-chain mitigation.

I'll file these as a single doc-followup issue post-merge so the PR doesn't keep growing. Quality bar for shipping: doc-correctness regressions resolved (no contradictions, no missing commands, no stale flag descriptions).

danielmeppiel and others added 2 commits April 22, 2026 10:36
- manifest-schema.md: add policy: block to schema diagram + new
  section 3.9 documenting fetch_failure_default, hash, hash_algorithm
- policy-reference.md: add fetch_failure: warn to canonical schema
  YAML and a fetch_failure entry under Top-level fields; lift apm
  policy status and apm audit --ci auto-discovery into proper
  numbered subsections (9.7 / 9.8) so anchors match the skill mirror
- governance.md: surface install-time enforcement with link to
  policy-reference#install-time-enforcement
- ci-policy-setup.md: annotate Step 3 noting apm audit --ci
  auto-discovers and --policy org is now an explicit override
- security.md: add Compromised policy intermediary row to attack
  surface comparison, linked to policy.hash consumer-side pin
- cli-commands.md: split --no-policy into 2-line nested bullet
  separating behaviour from env-var equivalence
- apm-guide skill mirror: add fetch_failure: warn to schema overview
  to keep skill aligned with policy-reference

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BLOCKING:
- command_logger.policy_discovery_miss: gate no_git_remote info
  message on verbose mode; previously emitted on every install in a
  non-git directory

Architecture:
- New install/errors.py with canonical PolicyViolationError;
  PolicyBlockError kept as re-exported alias to preserve test patches
- New policy/outcome_routing.py::route_discovery_outcome
  consolidating the 9-outcome routing table; policy_gate.py and
  install_preflight.py now delegate instead of duplicating
- pipeline.py: catch PolicyViolationError before bare Exception so
  policy block messages are not double-nested in RuntimeError
- commands/install.py: isinstance(PolicyViolationError) branch in
  the legacy handler for the same reason

Logging UX:
- install_preflight: empty check.details now falls back to
  [check.name] so the block message is never blank
- _extract_dep_ref helper replaces detail.split(":")[0] with
  defensive parsing that falls back to check.name

Security:
- discovery._get_cache_dir asserts containment vs project_root
  (resolves symlinks) instead of an unguarded join
- Removed dead no_policy= kwarg from discover_policy_with_chain;
  env-var defence-in-depth retained on the call site

Tests: +tests/unit/policy/test_pr_832_findings.py covering all 8
  findings; install_logger split into silent/verbose cases. 5176
  unit tests pass, 0 regressions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

Panel + doc-writer follow-ups shipped

Two commits landed addressing every HIGH/LOW from the doc-writer pass and the BLOCKING + 7 optional findings from the automated review panel comment.

7a129a14 — docs(policy): doc-writer HIGH+LOW findings

  • manifest-schema.md: policy: block in schema diagram + new section 3.9 (fetch_failure_default, hash, hash_algorithm)
  • policy-reference.md: fetch_failure: warn in canonical schema YAML + Top-level fields entry; lifted apm policy status and apm audit --ci auto-discovery into proper §9.7 / §9.8 (anchors now match the apm-guide skill mirror)
  • governance.md: install-time enforcement paragraph linked to policy-reference#install-time-enforcement
  • ci-policy-setup.md: Step 3 annotation noting apm audit --ci auto-discovers
  • security.md: "Compromised policy intermediary" row added to attack surface table, linked to policy.hash
  • cli-commands.md: --no-policy split into 2-line bullet (behaviour vs env-var)
  • apm-guide skill mirror synced

5cf221dc — fix(policy): panel logging+arch findings

# Finding Resolution
1 BLOCKING: no_git_remote info noise Verbose-gated in policy_discovery_miss
2 PolicyBlockError/PolicyViolationError duplicates New install/errors.py (canonical); PolicyBlockError kept as re-exported alias
3 Bare except double-nests message pipeline.py catches PolicyViolationError before Exception; matching isinstance branch in legacy handler
4 9-outcome routing duplicated New policy/outcome_routing.py::route_discovery_outcome; both policy_gate.py and install_preflight.py delegate
5 Empty check.details -> blank message Falls back to [check.name]
6 detail.split(":")[0] parsing Replaced with defensive _extract_dep_ref helper
7 Cache-path symlink escape _get_cache_dir asserts containment vs project_root
8 Dead no_policy= kwarg Removed from discover_policy_with_chain; env-var defence-in-depth retained

Verification

  • 5176 unit tests pass, 0 regressions
  • 22 policy install e2e tests pass
  • New tests/unit/policy/test_pr_832_findings.py covers all 8 panel findings
  • install_logger_policy.py test split into silent + verbose cases for finding Why do we need a GitHub token? #1

Working tree clean, branch pushed.

…827)

CodeQL's py/incomplete-url-substring-sanitization rule fired 6 times
on test_extends_host_pin.py because bare 'host' in msg substring
checks could in theory match a host appearing at an arbitrary URL
position (path, query, userinfo). The assertions are correct in
practice -- they assert on production error messages of known
format -- but the pattern is not safe in general.

Replace each substring check with a precise extractor:

- _assert_extends_host_in_message / _assert_leaf_host_in_message:
  regex-anchor on the production 'extends host: <h>' / 'leaf host:
  <h>' tokens, then exact-compare the captured group.
- _assert_redirect_target_host: regex-extract the redirect target
  URL after 'to ', then urllib.parse.urlparse(...).hostname compare.

No production-code changes; all 9 host-pin tests still pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

DevX UX Expert Review — PR #832

Summary

Strong overall execution — the command surface is recognizable to anyone who knows npm/pip, error messages include concrete next actions, and the docs are unusually thorough for a first pass. Six findings below, two of which I'd call blockers because they will confuse CI authors or violate the "every error has a next action" contract.


Findings

BLOCKER — apm audit --no-policy help text describes the negative case, not the positive

Surface: audit.py:454-457 — help string for --no-policy

What a developer sees when they type apm audit --help:

--no-policy    Skip auto-discovery of org policy in --ci mode.
               Has no effect when --policy is passed explicitly.

The second sentence tells me what it does NOT do. A developer who reads this still doesn't know what it DOES when --policy is absent — which is the primary use case. Compare to pip install --no-deps: "Don't install package dependencies." Positive, one clause, done.

Recommendation: Rewrite to state positive behaviour first, then the caveat:

help=(
    "Skip org policy discovery and enforcement. "
    "Overridden when --policy is passed explicitly."
)

This reads correctly in both --help and in the docs. The current cli-commands.md version (line 844) already has the better phrasing — bring the code into alignment.


BLOCKER — apm policy status always exits 0, making it unusable as a CI pre-check

Surface: policy.py:662sys.exit(0), documented in cli-commands.md and policy-reference.md

Issue: The command is positioned as a "trust-but-verify diagnostic for admins and CI gates" (policy-reference.md §9.7), yet a CI step that wants to fail when policy is unreachable or misconfigured cannot use exit codes. The JSON output has outcome and error fields, so a CI author must now pipe apm policy status --json through jq to extract a verdict — that's a composability regression versus every other APM command.

npm audit exits non-zero on findings. pip check exits non-zero on broken deps. This is the expected contract for CI.

Recommendation: Add --check (or --strict) flag that exits non-zero when outcome is not found or enforcement is n/a. Keep the default exit-0 for human usage and SIEM ingestion; give CI authors a one-flag path:

apm policy status --check   # exits 1 if policy unresolvable

Document the exit code table:

Mode Found Absent/error
default 0 0
--check 0 1

CONCERN — --no-policy on audit silently ignored outside --ci mode

Surface: audit.pyno_policy is only read inside the if ci: branch (line 541)

Issue: apm audit --no-policy (without --ci) succeeds silently, doing nothing with the flag. This violates the principle that flags that change behaviour should tell the user when they don't apply. Compare: --verbose on audit --ci emits a --verbose has no effect in --ci mode warning (line 501). --no-policy without --ci should do the same.

Recommendation: Add a guard after the if ci: block:

if no_policy and not ci:
    logger.warning("--no-policy has no effect outside --ci mode")
```

---

#### CONCERN — `apm deps update` has no `--no-policy` flag but IS gated by policy

**Surface**: `cli-commands.md` (line 918), `policy-reference.md` §5 table, §8 escape hatches

**Issue**: The docs correctly call this out as a gap and point to `APM_POLICY_DISABLE=1` as the "only escape hatch". Good transparencybut a user blocked on `apm deps update` by a transient policy fetch failure will see an error message that says `use --no-policy to bypass` (from `command_logger.py:617`). That flag doesn't exist on `deps update`. The error message lies.

**Recommendation**: Either (a) wire `--no-policy` through to `deps update` (consistent surface), or (b) make the error message context-aware and mention `APM_POLICY_DISABLE=1` when running from a command that doesn't expose `--no-policy`. Option (a) is cleaner and matches user expectation ("if install has it, update should too").

---

#### MINOR — `--json` and `-o json` on `apm policy status` are redundant without warning

**Surface**: `policy.py:299-309`, `cli-commands.md`

**Issue**: The command accepts both `--json` (boolean flag) and `-o json` (choice). The docs call `--json` an "alias of `-o json`". This is finebut what happens with `--json -o table`? Looking at line 350 of the diff (`use_json = as_json or output_format.lower() == "json"`), `--json` wins silently. The user gets JSON when they asked for table. This is a minor surprise, but it's the kind of inconsistency that erodes trust in a CLI tool. `npm` doesn't have competing output flags.

**Recommendation**: Either (a) remove `--json` and keep only `-o json` (cleaner, one way to do it), or (b) emit a warning when both are passed with conflicting values. I'd prefer (a) — it matches `apm audit` which uses `-f`/`--format` as the single axis.

---

#### MINOR — Rollback message is informational, not actionable

**Surface**: `install.py` line 69`logger.progress("apm.yml restored to its previous state.")`

**Issue**: The message appears as `[i] apm.yml restored to its previous state.`this tells the user what happened but not what to do. The blocking error above it already has the remediation ("remove X from apm.yml, contact admin, or use --no-policy"), so the rollback line is supplementary. However, a new user seeing `[i]` (info) for what is essentially a recovery-from-error action might not realise it's a consequence of the failure.

**Recommendation**: No code change strictly needed, but consider promoting to `logger.warning()` with a brief link to the context:
```
[!] apm.yml rolled backthe package was not added. See error above for next steps.

This turns an observation into a recovery breadcrumb.


What's done well

  • Error messages are exemplary. Every policy violation names the dep, the source, and three concrete next actions. This is best-in-class for a package manager. _policy_reason_blocked (command_logger.py:612-617) is the template I'd point other tools at.
  • --dry-run integration is thoughtful. Previewing "would be blocked by policy" without mutating anything is exactly right. The 5-line cap with overflow is a nice terminal-space consideration.
  • Manifest rollback on apm install <pkg>. Atomic snapshot + restore is the correct pattern. The byte-exact snapshot (not re-serialised YAML) avoids drift — well designed.
  • Docs are PR-complete. cli-commands.md, policy-reference.md, and manifest-schema.md all updated in the same PR. The example-driven CLI output in §7 of the policy reference is exactly what I want as a new user.
  • APM_POLICY_DISABLE env var naming follows the APM_ prefix convention and reads naturally. Reserving APM_POLICY for a future override is forward-thinking.

Quality Gate

  • Flag naming (--no-policy) familiar to npm/pip/cargo users
  • Error messages include concrete next actions (exemplary)
  • --dry-run policy preview documented and working
  • Manifest rollback is atomic and byte-exact
  • cli-commands.md updated in same PR
  • BLOCKER: audit --no-policy help text needs rewrite (describes negative, not positive)
  • BLOCKER: apm policy status needs --check for CI composability (always-0 is not a CI gate)
  • CONCERN: --no-policy silently ignored outside --ci on audit
  • CONCERN: deps update policy errors suggest --no-policy which doesn't exist on that command

Generated by PR Review Panel for issue #832 · ● 33.5M ·

@danielmeppiel
Copy link
Copy Markdown
Collaborator Author

@copilot address the blockers (only the blockers) from the last panel review comment

- audit --no-policy help text rewritten to describe positive
  behaviour first ("Skip org policy discovery and enforcement"
  instead of the negative "Skip auto-discovery ... in --ci mode"),
  so apm audit --help no longer hides the primary effect behind a
  caveat. Aligns the code with the docs.

- apm policy status --check flag added: exits 1 when outcome is
  not 'found' (i.e. policy unresolvable / absent / disabled /
  fetch-failed), 0 otherwise. Default behaviour unchanged (always
  exit 0) so the diagnostic remains safe for human and SIEM use,
  while CI authors get the npm audit / pip check style contract
  via a single flag.

Updates cli-commands.md, policy-reference.md, and CHANGELOG.md to
document the new flag and exit-code table. Adds TestStatusCheckFlag
covering the found / unresolvable / discovery-exception / json
combinations.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel danielmeppiel merged commit 0ec8669 into main Apr 22, 2026
11 checks passed
@danielmeppiel danielmeppiel deleted the issue-827-install-policy-enforcement branch April 22, 2026 10:44
danielmeppiel added a commit that referenced this pull request Apr 22, 2026
- Bump version to 0.9.1 in pyproject.toml and uv.lock
- Move CHANGELOG [Unreleased] entries into [0.9.1] - 2026-04-22
- Audit pass: 1 PR = 1 entry, no bloat
  - Consolidate the seven scattered entries from #832 into a single
    Added line that names the closed issues (#827/#829/#831/#834)
    and keeps the migration warning inline
  - Combine the three fork-PR fixes (#826, #836, #837) into one
    Fixed entry per the multi-PR convention
  - Drop the doc-only consolidation commits from the log

Highlights of 0.9.1:
- Install-time enforcement of org apm-policy.yml (#832)
- pr-review-panel automation now usable from fork PRs (#824, #826,
  #836, #837)
- Docs site publish gated on stable releases only (#822)
- Repository dogfooding via .apm/ as primitive source of truth (#823)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
danielmeppiel added a commit that referenced this pull request Apr 22, 2026
- Bump version to 0.9.1 in pyproject.toml and uv.lock
- Move CHANGELOG [Unreleased] entries into [0.9.1] - 2026-04-22
- Audit pass: 1 PR = 1 entry, no bloat
  - Consolidate the seven scattered entries from #832 into a single
    Added line that names the closed issues (#827/#829/#831/#834)
    and keeps the migration warning inline
  - Combine the three fork-PR fixes (#826, #836, #837) into one
    Fixed entry per the multi-PR convention
  - Drop the doc-only consolidation commits from the log

Highlights of 0.9.1:
- Install-time enforcement of org apm-policy.yml (#832)
- pr-review-panel automation now usable from fork PRs (#824, #826,
  #836, #837)
- Docs site publish gated on stable releases only (#822)
- Repository dogfooding via .apm/ as primitive source of truth (#823)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants