feat: CDN-first download for unauthenticated github.com virtual files#322
feat: CDN-first download for unauthenticated github.com virtual files#322
Conversation
…h failure GitHub returns 403 with X-RateLimit-Remaining: 0 when the primary rate limit is exhausted. Previously this was treated as an authentication failure, showing a misleading "private repository" error. Now: - _resilient_get retries 403 responses that carry rate-limit exhaustion headers (X-RateLimit-Remaining: 0), using X-RateLimit-Reset for wait - _download_github_file detects rate limiting before falling through to the auth-failure error path, surfacing actionable guidance - New unit tests cover all rate-limit detection paths Co-authored-by: danielmeppiel <51440732+danielmeppiel@users.noreply.github.com>
…ercontent.com Use raw.githubusercontent.com (CDN, no rate limit) for unauthenticated github.com file downloads before falling back to the Contents API. - Add build_raw_content_url() helper to github_host.py - Add _try_raw_download() method for best-effort CDN fetch - Modify _download_github_file() to try CDN first when no token is set - CDN path handles main/master branch fallback - Authenticated requests and enterprise hosts still use Contents API directly - Add 10 new tests for CDN download path + 2 tests for URL builder Co-authored-by: danielmeppiel <51440732+danielmeppiel@users.noreply.github.com>
…e review Co-authored-by: danielmeppiel <51440732+danielmeppiel@users.noreply.github.com>
| assert result == b'# Agent content' | ||
| # Should have hit raw.githubusercontent.com, not API | ||
| call_url = mock_get.call_args[0][0] | ||
| assert 'raw.githubusercontent.com' in call_url |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 17 hours ago
In general, to avoid incomplete URL substring sanitization, the code should parse the URL and inspect its components (typically the hostname) instead of checking for a substring within the entire URL string. This avoids false matches where the trusted host appears in the path, query, or as part of another hostname.
For this specific test in tests/test_github_downloader.py, we should stop using 'raw.githubusercontent.com' in call_url and 'api.github.com' not in call_url, and instead parse call_url with urllib.parse.urlparse, then assert on parsed.hostname. This maintains the existing functional intent—verifying which host was used—while avoiding substring matching on the full URL string. The file already imports urlparse on line 10, so no new imports are needed.
Concretely:
- In
TestRawContentCDNDownload.test_raw_cdn_used_for_github_com_without_token, replace:with:call_url = mock_get.call_args[0][0] assert 'raw.githubusercontent.com' in call_url assert 'api.github.com' not in call_url
call_url = mock_get.call_args[0][0] parsed = urlparse(call_url) assert parsed.hostname == 'raw.githubusercontent.com'
- In
TestRawContentCDNDownload.test_raw_cdn_not_used_when_token_present, replace:with:call_url = mock_get.call_args[0][0] assert 'api.github.com' in call_url
call_url = mock_get.call_args[0][0] parsed = urlparse(call_url) assert parsed.hostname == 'api.github.com'
No other changes or new helpers are required.
| @@ -1291,8 +1291,8 @@ | ||
| assert result == b'# Agent content' | ||
| # Should have hit raw.githubusercontent.com, not API | ||
| call_url = mock_get.call_args[0][0] | ||
| assert 'raw.githubusercontent.com' in call_url | ||
| assert 'api.github.com' not in call_url | ||
| parsed = urlparse(call_url) | ||
| assert parsed.hostname == 'raw.githubusercontent.com' | ||
|
|
||
| def test_raw_cdn_not_used_when_token_present(self): | ||
| """Authenticated requests should go straight to Contents API.""" | ||
| @@ -1314,7 +1314,8 @@ | ||
| assert result == b'# Agent content' | ||
| # Should use API with auth, not raw CDN | ||
| call_url = mock_get.call_args[0][0] | ||
| assert 'api.github.com' in call_url | ||
| parsed = urlparse(call_url) | ||
| assert parsed.hostname == 'api.github.com' | ||
|
|
||
| def test_raw_cdn_not_used_for_enterprise_host(self): | ||
| """Enterprise hosts should use API directly (no raw.githubusercontent.com).""" |
| assert result == b'# Agent content' | ||
| # Should use API with auth, not raw CDN | ||
| call_url = mock_get.call_args[0][0] | ||
| assert 'api.github.com' in call_url |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 17 hours ago
In general, to avoid incomplete URL substring sanitization, the code should parse the URL and compare the parsed hostname (and optionally scheme and path) instead of checking for string containment. For this specific test, we can keep the semantics—verifying that the request targets the GitHub API—while using urllib.parse.urlparse to assert that the hostname is api.github.com, rather than checking that 'api.github.com' appears somewhere in the full URL string.
Concretely, in tests/test_github_downloader.py, within TestGitHubPackageDownloaderIntegration.test_raw_cdn_not_used_when_token_present, replace:
call_url = mock_get.call_args[0][0]
assert 'api.github.com' in call_urlwith code that parses the URL and inspects its hostname:
call_url = mock_get.call_args[0][0]
parsed_url = urlparse(call_url)
assert parsed_url.hostname == 'api.github.com'The file already imports urlparse at the top (from urllib.parse import urlparse), so no new imports are required. This change does not alter the intended behavior of the test; it simply makes the assertion more precise and immune to the substring issue.
| @@ -1314,7 +1314,8 @@ | ||
| assert result == b'# Agent content' | ||
| # Should use API with auth, not raw CDN | ||
| call_url = mock_get.call_args[0][0] | ||
| assert 'api.github.com' in call_url | ||
| parsed_url = urlparse(call_url) | ||
| assert parsed_url.hostname == 'api.github.com' | ||
|
|
||
| def test_raw_cdn_not_used_for_enterprise_host(self): | ||
| """Enterprise hosts should use API directly (no raw.githubusercontent.com).""" |
| assert result == b'# Agent content' | ||
| call_url = mock_get.call_args[0][0] | ||
| assert 'raw.githubusercontent.com' not in call_url | ||
| assert 'github.mycompany.com' in call_url |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 17 hours ago
In general, the fix is to avoid checking for hostnames using substring operations on the entire URL string, and instead parse the URL and inspect its hostname (or related components) explicitly. For tests, this means parsing the URL the code under test generated, and asserting on parsed.hostname rather than using "host" in url_string.
For this specific test in tests/test_github_downloader.py, we should replace:
call_url = mock_get.call_args[0][0]
assert 'raw.githubusercontent.com' not in call_url
assert 'github.mycompany.com' in call_urlwith code that uses urllib.parse.urlparse (already imported at line 10) to parse call_url and assert that the hostname is exactly github.mycompany.com and not raw.githubusercontent.com. This change keeps the test’s intent (ensuring the enterprise host is used rather than the raw CDN) while eliminating substring-based URL checks. No new imports or helpers are required; we just reuse the existing urlparse import.
Concretely, in the body of test_raw_cdn_not_used_for_enterprise_host around lines 1337–1340, we will:
-
Keep the assignment
call_url = mock_get.call_args[0][0]. -
Replace the two substring assertions with:
parsed_url = urlparse(call_url) assert parsed_url.hostname != 'raw.githubusercontent.com' assert parsed_url.hostname == 'github.mycompany.com'
This does not alter production functionality; it only strengthens the test’s check and removes the pattern triggering CodeQL.
| @@ -1336,8 +1336,9 @@ | ||
|
|
||
| assert result == b'# Agent content' | ||
| call_url = mock_get.call_args[0][0] | ||
| assert 'raw.githubusercontent.com' not in call_url | ||
| assert 'github.mycompany.com' in call_url | ||
| parsed_url = urlparse(call_url) | ||
| assert parsed_url.hostname != 'raw.githubusercontent.com' | ||
| assert parsed_url.hostname == 'github.mycompany.com' | ||
|
|
||
| def test_raw_cdn_fallback_to_api_on_404(self): | ||
| """If raw CDN returns 404, should fall through to the API path.""" |
There was a problem hiding this comment.
Pull request overview
Adds a CDN-first strategy for unauthenticated github.com virtual-file downloads to reduce/avoid GitHub Contents API rate limiting, while keeping authenticated/enterprise/ADO behavior on the API path.
Changes:
- Add
build_raw_content_url()helper forraw.githubusercontent.comURLs. - Add
_try_raw_download()and wire a CDN-first fast-path into_download_github_file()(with main/master fallback + API fallback). - Extend unit tests to cover CDN behavior and improved 403 rate-limit handling; add changelog entries.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/apm_cli/utils/github_host.py |
Introduces raw CDN URL builder used by the downloader. |
src/apm_cli/deps/github_downloader.py |
Implements raw CDN best-effort fetch + rate-limit-aware retry and error messaging. |
tests/unit/test_github_host.py |
Adds unit tests for the raw CDN URL builder. |
tests/test_github_downloader.py |
Adds tests for CDN-first logic and 403 rate-limit handling behavior. |
CHANGELOG.md |
Documents the fix/feature under Unreleased. |
Comments suppressed due to low confidence (3)
src/apm_cli/deps/github_downloader.py:792
- Host matching for github.com is inconsistent: earlier you use
host.lower() == "github.com", but the API URL and fallback URL branches checkif host == "github.com". IfGITHUB_HOST(or parsed host) ever has different casing (e.g., "GitHub.com"), this will incorrectly route to the GHES/api/v3/...endpoint. Please normalize with.lower()consistently for both the main and fallback API URL builders.
# --- Contents API path (authenticated, enterprise, or raw fallback) ---
# Build GitHub API URL - format differs by host type
if host == "github.com":
api_url = f"https://api.github.com/repos/{owner}/{repo}/contents/{file_path}?ref={ref}"
elif host.lower().endswith(".ghe.com"):
api_url = f"https://api.{host}/repos/{owner}/{repo}/contents/{file_path}?ref={ref}"
else:
api_url = f"https://{host}/api/v3/repos/{owner}/{repo}/contents/{file_path}?ref={ref}"
# Set up authentication headers
headers = {
'Accept': 'application/vnd.github.v3.raw' # Returns raw content directly
}
if self.github_token:
headers['Authorization'] = f'token {self.github_token}'
# Try to download with the specified ref
try:
response = self._resilient_get(api_url, headers=headers, timeout=30)
response.raise_for_status()
return response.content
except requests.exceptions.HTTPError as e:
if e.response.status_code == 404:
# Try fallback branches if the specified ref fails
if ref not in ["main", "master"]:
# If original ref failed, don't try fallbacks - it might be a specific version
raise RuntimeError(f"File not found: {file_path} at ref '{ref}' in {dep_ref.repo_url}")
# Try the other default branch
fallback_ref = "master" if ref == "main" else "main"
# Build fallback API URL
if host == "github.com":
fallback_url = f"https://api.github.com/repos/{owner}/{repo}/contents/{file_path}?ref={fallback_ref}"
elif host.lower().endswith(".ghe.com"):
fallback_url = f"https://api.{host}/repos/{owner}/{repo}/contents/{file_path}?ref={fallback_ref}"
else:
fallback_url = f"https://{host}/api/v3/repos/{owner}/{repo}/contents/{file_path}?ref={fallback_ref}"
src/apm_cli/deps/github_downloader.py:266
- In _resilient_get(), the rate-limit branch sleeps and
continues even on the final attempt. That can add up to an extra (up to) 60s delay right before returning the last 403/429/503 response with no further retry. Consider only sleeping/continuing whenattempt < max_retries - 1, and otherwise break/return immediately.
This issue also appears on line 288 of the same file.
if is_rate_limited:
last_response = response
retry_after = response.headers.get("Retry-After")
reset_at = response.headers.get("X-RateLimit-Reset")
if retry_after:
try:
wait = min(float(retry_after), 60)
except (TypeError, ValueError):
# Retry-After may be an HTTP-date; fall back to exponential backoff
wait = min(2 ** attempt, 30) * (0.5 + random.random())
elif reset_at:
try:
wait = max(0, min(int(reset_at) - time.time(), 60))
except (TypeError, ValueError):
wait = min(2 ** attempt, 30) * (0.5 + random.random())
else:
wait = min(2 ** attempt, 30) * (0.5 + random.random())
_debug(f"Rate limited ({response.status_code}), retry in {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
time.sleep(wait)
continue
src/apm_cli/deps/github_downloader.py:296
- _resilient_get() returns
last_responseif any rate-limited response occurred, even if later attempts failed with ConnectionError/Timeout (becauselast_excis ignored whenlast_responseis set). This can mask real network failures as a rate-limit response. Please ensure that when the final outcome is an exception, that exception is raised (or otherwise preferlast_excover a stalelast_response).
# If rate limiting exhausted all retries, return the last response so
# callers can inspect headers (e.g. X-RateLimit-Remaining) and raise
# an appropriate user-facing error.
if last_response is not None:
return last_response
if last_exc:
raise last_exc
raise requests.exceptions.RequestException(f"All {max_retries} attempts failed for {url}")
| Returns: | ||
| str: ``https://raw.githubusercontent.com/{owner}/{repo}/{ref}/{file_path}`` | ||
| """ | ||
| return f"https://raw.githubusercontent.com/{owner}/{repo}/{ref}/{file_path}" |
| ### Fixed | ||
|
|
||
| - GitHub API rate-limit 403 responses no longer misdiagnosed as authentication failures — unauthenticated users now see actionable "rate limit exceeded" guidance instead of misleading "private repository" errors | ||
| - Virtual file downloads from public github.com repos no longer require authentication — uses `raw.githubusercontent.com` CDN (no rate limit) before falling back to the Contents API |
… deps Combines CDN fast-path (#322) with git credential fill fallback (#332): - raw.githubusercontent.com CDN first for unauthenticated github.com downloads - git credential fill fallback when no env token is set - GH_TOKEN recognition in token precedence - GitHub API rate-limit 403 detection (not misdiagnosed as auth failure) - Per-host token resolution for enterprise/custom domains - URL-encode refs in CDN URLs for branches with slashes - GIT_ASKPASS=echo hardening for non-interactive credential fill - Improved error messages suggesting gh auth login Closes #319, #331 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Closing — these changes (CDN-first download + rate-limit 403 detection) have been merged into #332 along with the credential-fill fallback, addressing all review comments from both PRs. |
Description
Virtual file downloads (
.agent.md,.prompt.md, collections) use the GitHub Contents API, which caps unauthenticated requests at 60/hour shared per IP. Full repo and subdirectory downloads already use git protocol and are unaffected.This adds a CDN fast-path: for
github.comwithout a token,_download_github_filenow triesraw.githubusercontent.comfirst — no rate limit, no auth required, single HTTP GET from CDN. Falls back to the Contents API on failure (private repos, enterprise hosts, network errors).github_host.py: Addbuild_raw_content_url()helpergithub_downloader.py: Add_try_raw_download()— best-effort CDN fetch returningNoneon failure. Modify_download_github_file()to use CDN-first with main/master branch fallback when unauthenticated on github.comType of change
Testing
12 new unit tests: CDN success path, token bypass, enterprise bypass, 404→API fallback, main/master branch fallback, network error fallback, specific ref handling,
_try_raw_downloadcontract,build_raw_content_urlbuilder.