Skip to content

[policy] Add policy.fetch_failure: warn|block schema knob for fail-closed enforcement (follow-up to #827) #829

@danielmeppiel

Description

@danielmeppiel

Context

In #827 we shipped install-time policy enforcement for APM. When the policy fetch fails — cache miss combined with network failure, malformed YAML response, or garbage HTTP response — the current behavior is fail-open: the install proceeds and a warning is logged.

The three PolicyFetchResult outcomes that currently fail-open are:

  • cache_miss_fetch_fail — no cached policy and the remote fetch failed (network error, timeout, DNS).
  • malformed — a response was received but could not be parsed as valid policy YAML.
  • garbage_response — the HTTP response was non-YAML (HTML error page, empty body, etc.).

This was a deliberate v1 design decision to avoid bricking developer environments when an org policy server is flaky or unreachable. For enterprises that require fail-closed behavior (no install proceeds without a verified policy verdict), we need a configurable knob.

Proposal

Add a fetch_failure key under the policy section of apm-policy.yml:

policy:
  fetch_failure: warn  # default — current behavior (fail-open)
  # fetch_failure: block  # fail-closed — install exits non-zero on fetch failure
Value Behavior
warn (default) Log a warning and proceed with install. Current behavior, unchanged.
block Exit non-zero with a clear error message. No packages are installed.

Important: cache_stale (a cached policy within MAX_STALE_TTL) is unaffected by this knob — a stale-but-valid cache is always used regardless of the fetch_failure setting. This knob only governs the three hard-failure outcomes listed above.

Acceptance criteria

  • Schema validation accepts both warn and block as valid values for policy.fetch_failure; rejects anything else.
  • Default value is warn — no behavioral change for existing users.
  • When set to block, any of the three failure outcomes (cache_miss_fetch_fail, malformed, garbage_response) causes install to exit non-zero with a clear, actionable error message (e.g., "Policy fetch failed and fetch_failure is set to block. Cannot proceed.").
  • cache_stale (within MAX_STALE_TTL) is unaffected — always uses the cached policy regardless of this setting.
  • Documentation page updated to describe the knob, default, and enterprise use case.
  • Unit tests cover both warn and block values across all three failure outcomes.
  • Integration test confirms block mode exits non-zero and warn mode logs warning but exits zero.

Open questions

  1. apm install --dry-run — Should block mode also prevent dry-run from completing? Or should dry-run always succeed (since it does not actually install anything)?
  2. --no-policy bypass — Should --no-policy still bypass enforcement even in block mode? CEO recommendation: yes. The escape hatch is the escape hatch — if an engineer explicitly opts out with --no-policy, that intent should be respected even in strict mode. The audit trail (CLI logs) already records the bypass.

Out of scope

  • Non-GitHub VCS policy hosting (e.g., GitLab, Bitbucket policy sources) — separate feature.
  • Signing or integrity verification of the policy file itself — separate security issue.
  • Granular per-rule failure modes (e.g., some rules fail-open, others fail-closed) — future iteration if demand materializes.

Follow-up to #827. Filed per CEO mandate during panel review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementDeprecated: use type/feature. Kept for issue history; will be removed in milestone 0.10.0.good first issueGood for newcomerspolicyDeprecated: use area/audit-policy. Kept for issue history; will be removed in milestone 0.10.0.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions