fix(install): honour REQUESTS_CA_BUNDLE in package validator#911
fix(install): honour REQUESTS_CA_BUNDLE in package validator#911danielmeppiel merged 6 commits intomicrosoft:mainfrom
Conversation
The validator probed the GitHub API via stdlib urllib, which only honours SSL_CERT_FILE. Behind a TLS-intercepting proxy this surfaced as a misleading "package not accessible" error, sending users down the PAT troubleshooting path instead of the CA-trust one. Switch both _check_repo helpers to requests.get, catch SSLError explicitly, log a CA-trust hint to --verbose, and skip the auth-error context render on TLS failures.
@microsoft-github-policy-service agree |
| # requests.exceptions.SSLError is the canonical type once we stop | ||
| # using urllib; check by name to avoid importing requests at module | ||
| # import time. | ||
| if type(cur).__name__ == "SSLError": |
There was a problem hiding this comment.
The TLS classification checks type(cur).__name__ == "SSLError", which can misclassify unrelated exception types that happen to share that class name (leading to skipping auth error context incorrectly). Since this module already imports requests, use a precise check (e.g., isinstance(cur, requests.exceptions.SSLError)) and update/remove the comment about avoiding a requests import at module import time (it’s no longer accurate here).
| # requests.exceptions.SSLError is the canonical type once we stop | |
| # using urllib; check by name to avoid importing requests at module | |
| # import time. | |
| if type(cur).__name__ == "SSLError": | |
| # requests.exceptions.SSLError is the canonical TLS verification error | |
| # type for requests-based probes. | |
| if isinstance(cur, requests.exceptions.SSLError): |
| with patch( | ||
| "requests.get", | ||
| side_effect=requests.exceptions.SSLError("CERTIFICATE_VERIFY_FAILED"), | ||
| ): |
There was a problem hiding this comment.
Patching "requests.get" works but is less robust than patching the call site used by the code under test. Prefer patching apm_cli.install.validation.requests.get so the test remains correct even if validation.py changes how it imports/aliases requests.
| def test_validation_module_does_not_import_urllib_request_urlopen(self): | ||
| from pathlib import Path | ||
|
|
||
| src = Path(validation.__file__).read_text(encoding="utf-8") | ||
| # Forbid the specific call form; importing urllib for other reasons | ||
| # remains acceptable. | ||
| assert "urllib.request.urlopen" not in src, ( | ||
| "install/validation.py must use 'requests' for HTTP probes so it " | ||
| "honours REQUESTS_CA_BUNDLE the same way the rest of the codebase " | ||
| "does. Replace urllib.request.urlopen with requests.get." | ||
| ) |
There was a problem hiding this comment.
This regression guard is brittle because it relies on substring matching in source text (it can miss equivalent forms like from urllib.request import urlopen, and can false-trigger if mentioned in a comment or docstring). Consider replacing it with an AST-based check for urllib.request.urlopen usage, or a runtime behavior assertion (e.g., patch urllib.request.urlopen to fail and exercise _validate_package_exists to ensure it’s never called).
| def test_validation_module_does_not_import_urllib_request_urlopen(self): | |
| from pathlib import Path | |
| src = Path(validation.__file__).read_text(encoding="utf-8") | |
| # Forbid the specific call form; importing urllib for other reasons | |
| # remains acceptable. | |
| assert "urllib.request.urlopen" not in src, ( | |
| "install/validation.py must use 'requests' for HTTP probes so it " | |
| "honours REQUESTS_CA_BUNDLE the same way the rest of the codebase " | |
| "does. Replace urllib.request.urlopen with requests.get." | |
| ) | |
| def test_validation_module_does_not_call_urllib_request_urlopen(self): | |
| resolver = MagicMock() | |
| fake_resp = MagicMock(ok=True, status_code=200, reason="OK") | |
| with patch( | |
| "urllib.request.urlopen", | |
| side_effect=AssertionError( | |
| "install/validation.py must use 'requests' for HTTP probes so it " | |
| "honours REQUESTS_CA_BUNDLE the same way the rest of the codebase " | |
| "does. Replace urllib.request.urlopen with requests.get." | |
| ), | |
| ) as urlopen_mock, patch("requests.get", return_value=fake_resp): | |
| result = validation._validate_package_exists( | |
| "octocat/hello-world", verbose=False, auth_resolver=resolver | |
| ) | |
| assert result is True | |
| urlopen_mock.assert_not_called() |
| verbose_log( | ||
| " hint: the CA bundle does not trust the certificate presented " | ||
| "(common behind a corporate TLS-intercepting proxy). " | ||
| "Set REQUESTS_CA_BUNDLE to your organisation's CA bundle and retry." |
There was a problem hiding this comment.
The verbose hint only mentions REQUESTS_CA_BUNDLE, but the PR description also calls out SSL_CERT_FILE as a common alternative. Consider mentioning both environment variables in the hint to reduce user troubleshooting time in environments that standardize on SSL_CERT_FILE.
| "Set REQUESTS_CA_BUNDLE to your organisation's CA bundle and retry." | |
| "Set REQUESTS_CA_BUNDLE or SSL_CERT_FILE to your organisation's " | |
| "CA bundle and retry." |
APM Review Panel VerdictDisposition: APPROVE (with two minor pre-merge fixes) Per-persona findingsPython Architect: This is a routine bug-fix PR (no class hierarchy changes). The module is purely procedural -- no OO structure is added or changed; two new module-level helpers ( 1. OO / class diagramThe PR is purely procedural -- no class changes anywhere in scope. Diagram shows module boundaries and function entry points, annotated with patterns where they apply. classDiagram
direction LR
class validation_module {
<<IOBoundary>>
+_validate_package_exists(package, verbose, auth_resolver, logger) bool
-_check_repo(token, git_env) bool
-_check_repo_fallback(token, git_env) bool
}
class tls_helpers {
<<Pure>>
+_is_tls_failure(exc) bool
+_log_tls_failure(host_display, exc, verbose_log) None
}
class AuthResolver {
<<Strategy>>
+classify_host(host) HostInfo
+try_with_fallback(host, op, ...) any
+build_error_context(host, ...) str
}
class HostInfo {
<<ValueObject>>
+api_base str
+display_name str
}
validation_module ..> tls_helpers : uses
validation_module ..> AuthResolver : delegates auth
AuthResolver ..> HostInfo : returns
note for tls_helpers "New in this PR: classifies SSLError vs auth\nfailures so users get the right hint"
2. Execution flow diagramflowchart TD
A["_validate_package_exists(package, ...)"] --> B["[NET] auth_resolver.classify_host(host)"]
B --> C["[NET] auth_resolver.try_with_fallback(host, _check_repo, unauth_first=True)"]
C --> D["[NET] requests.get(api_url, headers, timeout=15)"]
D --> E{Response?}
E -->|SSLError| F["_log_tls_failure(host_display, exc, verbose_log)\nVERBOSE ONLY"]
F --> G["raise RuntimeError('TLS verification failed ...')"]
G --> H["outer: except Exception as exc:"]
H --> I{"_is_tls_failure(exc)?"}
I -->|True| J["return False -- skips build_error_context"]
I -->|False| K["auth_resolver.build_error_context(...)\nlog auth hint"]
K --> L["return False"]
E -->|resp.ok == True| M["return True"]
E -->|404 + token| N["raise RuntimeError('API returned 404')"]
N --> H
E -->|other 4xx/5xx| O["raise RuntimeError('API returned N: reason')"]
O --> H
3. Design patternsDesign patterns
One required fix (Architect): CLI Logging Expert: Output paths are correct. DevX UX Expert: Before: proxy users got "package not accessible" -> PAT troubleshooting rabbit hole. After: with No CLI command surface changes; no Supply Chain Security Expert: Net positive. Auth Expert: Activated -- Auth flow is fully preserved. Minor observation (not a blocker): if OSS Growth Hacker: Corporate proxy compatibility is a silent enterprise adoption gate -- teams behind TLS-intercepting proxies hitting "package not accessible" errors churn silently without filing issues. This fix unblocks a real enterprise segment. Story angle: "APM now works out-of-the-box in corporate environments with TLS inspection." Missing: no Side-channel to CEO: A "Corporate proxy / REQUESTS_CA_BUNDLE" troubleshooting doc entry would capture enterprise search traffic and convert frustrated evaluators. Low effort, high leverage for the enterprise adoption story. CEO arbitrationThe change is well-scoped, well-tested, and materially improves APM's story for enterprise users. All specialists agree on correctness and security posture. Two items are non-negotiable before merge: the missing CHANGELOG entry (repo convention, every code-change PR must have one) and the dead comment (misleads future contributors about why the type-name check exists). The DevX UX gap -- CA hint only under Disposition ratified: APPROVE with two minor pre-merge fixes. Required actions before merge
Optional follow-ups
|
danielmeppiel
left a comment
There was a problem hiding this comment.
Required actions before merge
- Add a CHANGELOG.md entry under ## [Unreleased] > Fixed. Suggested line: "Fixed TLS validation failure behind corporate TLS-intercepting proxies: install/validation.py now uses requests (honouring REQUESTS_CA_BUNDLE) instead of stdlib urllib, and logs a CA-trust hint under --verbose instead of a misleading auth error (fix(install): honour REQUESTS_CA_BUNDLE in package validator #911)."
- Fix stale comment in _is_tls_failure (src/apm_cli/install/validation.py, the type(cur).name == "SSLError" branch): remove or update the "check by name to avoid importing requests at module import time" comment -- requests is now a top-level import, so this justification is no longer accurate. Replacing the name check with isinstance(cur, requests.exceptions.SSLError) is the cleanest fix.
- Consider emitting the CA-trust hint at default verbosity (not just --verbose) when a TLS failure is detected -- TLS failures are rare, always actionable, and the one-liner "Set REQUESTS_CA_BUNDLE ..." would save the most common class of confused enterprise users from having to re-run with --verbose.
- Add a troubleshooting docs entry ("APM behind a corporate proxy / TLS-intercepting proxy") in docs/src/content/docs/ covering REQUESTS_CA_BUNDLE setup. Low-effort, high-leverage for enterprise adoption and search discoverability.
Audit auth_resolver.try_with_fallback retry logic: confirm whether a RuntimeError from _check_repo triggers a retry attempt (which would cause duplicate _log_tls_failure verbose output).
isinstance(SSLError) check, dedup TLS hint across retry chain, surface CA-trust hint at default verbosity, drop dead fallback, tighten verbose output, add troubleshooting docs page and CHANGELOG entry. Assisted-by: Claude:claude-sonnet-4-6
|
@danielmeppiel I have updated the change log, fixed stale comment, added two hints, one verbose and one liner on TLS failure, but confirmed that retries will still be done, but hints will appear once, added a docs section on troubleshooting and moved the TLS validation error so it. |
APM Review Panel VerdictDisposition: APPROVE (one optional follow-up noted below) Per-persona findingsPython Architect: This is a routine bug-fix PR. No classes are added or modified anywhere in scope -- 1. OO / class diagramclassDiagram
direction LR
class validation {
<<Module>>
<<IOBoundary>>
+_validate_package_exists(package, verbose, auth_resolver, logger) bool
+_is_tls_failure(exc) bool
+_log_tls_failure(host_display, exc, verbose_log, logger) None
}
class AuthResolver {
<<Strategy>>
+try_with_fallback(host, op, ...) Any
+classify_host(host) HostInfo
+build_error_context(...) str
}
class HostInfo {
<<ValueObject>>
+api_base str
+display_name str
+kind str
}
class requests_lib {
<<ExternalLibrary>>
+get(url, headers, timeout) Response
exceptions_SSLError
exceptions_RequestException
}
validation ..> AuthResolver : inject
validation ..> requests_lib : uses
AuthResolver ..> HostInfo : returns
note for validation "touched: _is_tls_failure + _log_tls_failure added\n_check_repo + _check_repo_fallback migrated to requests"
class validation:::touched
classDef touched fill:#fff3b0,stroke:#d47600
2. Execution flow diagramflowchart TD
A["apm install\n_validate_package_exists()"] --> B{"URL parse\nsucceeds?"}
B -- yes --> C["auth_resolver.try_with_fallback\n(host, _check_repo, unauth_first=True)"]
B -- no --> D["auth_resolver.try_with_fallback\n(host, _check_repo_fallback, unauth_first=True)"]
C --> E["[NET] requests.get(api_url, headers, timeout=15)"]
D --> F["[NET] requests.get(api_url, headers, timeout=15)"]
E -- SSLError --> G["raise RuntimeError\n('TLS verification failed for host')"]
F -- SSLError --> G
E -- ok 200 --> H["return True"]
F -- ok 200 --> H
E -- 404+token --> I["raise RuntimeError(status_code)"]
F -- 4xx --> I
G --> J{"outer except\n_is_tls_failure(exc)?"}
I --> J
J -- YES --> K["_log_tls_failure()\nlogger.warning: CA-trust hint (always-on)\nverbose_log: host + exc text (if verbose)"]
K --> L["return False"]
J -- NO --> M["build_error_context / verbose log\nreturn False"]
Design patterns
Architecture assessment: clean. The bounded chain walk (max 8) and self-referential guard in CLI Logging Expert: Correct use of the output architecture throughout.
One nuance worth noting: DevX UX Expert: Significant UX improvement for enterprise users. The previous behavior pushed users into the auth troubleshooting funnel for a problem that had nothing to do with credentials -- a classic "misleading failure mode" anti-pattern. The fix addresses this correctly:
The "Not behind a proxy or firewall?" section in the docs page correctly redirects users to Supply Chain Security Expert: No security regressions; net improvement.
Auth Expert: AuthResolver flow is fully preserved.
OSS Growth Hacker: High-signal enterprise adoption unlock. Corporate-proxy TLS failures are the #1 silent killer for "APM doesn't work at my company" moments. Zscaler, Netskope, and Palo Alto collectively cover the majority of enterprise network environments. This fix removes a first-day blocker with no user-visible config changes required (beyond setting one env var they likely already know from pip). Specific conversion wins:
Side-channel to CEO: this is worth a dedicated callout in the release note for the next minor -- "APM now works out of the box behind corporate TLS-inspecting proxies (Zscaler, Netskope, Palo Alto). Set CEO arbitrationAll five specialists plus the Auth Expert reviewed this PR and reached the same conclusion independently: the change is correct, well-scoped, and a net improvement on every axis. There are no disagreements to arbitrate. The Growth Hacker's side-channel is noted -- the enterprise adoption narrative is real and the next release note should lead with it. The CHANGELOG entry already has the right framing; no change needed before merge. This is a textbook targeted bug fix: narrow diff, comprehensive tests (7 test cases covering detection, default verbosity, verbose detail, single-emission guarantee, auth-context bypass, 404 regression, and 200 happy path), matching docs, and a clean CHANGELOG entry. APPROVE. Required actions before merge
Optional follow-ups
|
* chore(release): cut 0.9.4 CHANGELOG entry for 0.9.4 covers all 7 PRs merged since v0.9.3: - #974 SKILL_BUNDLE day-0 install parity (Added) - #954 automate apm-triage-panel workflow (Added) - #970 python-architect mermaid classDiagram trap (Changed) - #911 REQUESTS_CA_BUNDLE TLS validation (Fixed) - #971 triage-panel project-sync dispatch (Fixed) - #910 CLI consistency cleanup (Fixed) - #958 issue templates label taxonomy (Fixed) - #953 docs auto-deploy after bot-cut releases (Fixed) Open milestone 0.9.4 issues (41) reassigned to 0.9.5. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore(changelog): tighten 0.9.4 entries (so-what for developers) Refactor per Keep-a-Changelog spirit: lead with developer impact, trim agent-internals prose, group maintainer-only changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore(changelog): add #660 install.sh air-gapped entry to 0.9.4 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Description
Behind corporate TLS-intercepting proxies, users often set
REQUESTS_CA_BUNDLE(orSSL_CERT_FILE) to point at their org's CA. The rest of APM usesrequestsand honoursREQUESTS_CA_BUNDLEcorrectly -- but the package validator was the one outlier: it used stdliburllib, which ignores it. Result: validation failed with a misleading "package not accessible" error, sending users down the PAT troubleshooting path instead of fixing CA trust.Fix: replace
urllibwithrequestsat this single point so the whole CLI has unified HTTP usage andREQUESTS_CA_BUNDLEis honoured everywhere. OnSSLError, log a CA-trust hint under--verboseand skip the auth-error context render.Fixes # (issue)
Type of change
Testing