Skip to content

feat: CDN-first download for unauthenticated github.com virtual files#322

Closed
Copilot wants to merge 4 commits intomainfrom
copilot/feature-support-individual-file-paths
Closed

feat: CDN-first download for unauthenticated github.com virtual files#322
Copilot wants to merge 4 commits intomainfrom
copilot/feature-support-individual-file-paths

Conversation

Copy link
Contributor

Copilot AI commented Mar 16, 2026

Description

Virtual file downloads (.agent.md, .prompt.md, collections) use the GitHub Contents API, which caps unauthenticated requests at 60/hour shared per IP. Full repo and subdirectory downloads already use git protocol and are unaffected.

This adds a CDN fast-path: for github.com without a token, _download_github_file now tries raw.githubusercontent.com first — no rate limit, no auth required, single HTTP GET from CDN. Falls back to the Contents API on failure (private repos, enterprise hosts, network errors).

  • github_host.py: Add build_raw_content_url() helper
  • github_downloader.py: Add _try_raw_download() — best-effort CDN fetch returning None on failure. Modify _download_github_file() to use CDN-first with main/master branch fallback when unauthenticated on github.com
  • No behavior change for authenticated requests (API, 5k req/h), GHES, GHE-DR, or Azure DevOps
Scenario Path
github.com, no token CDN → API fallback
github.com, with token API directly
Enterprise / ADO API directly

Type of change

  • Bug fix
  • New feature
  • Documentation
  • Maintenance / refactor

Testing

  • Tested locally
  • All existing tests pass
  • Added tests for new functionality (if applicable)

12 new unit tests: CDN success path, token bypass, enterprise bypass, 404→API fallback, main/master branch fallback, network error fallback, specific ref handling, _try_raw_download contract, build_raw_content_url builder.

Copilot AI and others added 4 commits March 16, 2026 09:00
…h failure

GitHub returns 403 with X-RateLimit-Remaining: 0 when the primary rate
limit is exhausted. Previously this was treated as an authentication
failure, showing a misleading "private repository" error. Now:

- _resilient_get retries 403 responses that carry rate-limit exhaustion
  headers (X-RateLimit-Remaining: 0), using X-RateLimit-Reset for wait
- _download_github_file detects rate limiting before falling through to
  the auth-failure error path, surfacing actionable guidance
- New unit tests cover all rate-limit detection paths

Co-authored-by: danielmeppiel <51440732+danielmeppiel@users.noreply.github.com>
…ercontent.com

Use raw.githubusercontent.com (CDN, no rate limit) for unauthenticated
github.com file downloads before falling back to the Contents API.

- Add build_raw_content_url() helper to github_host.py
- Add _try_raw_download() method for best-effort CDN fetch
- Modify _download_github_file() to try CDN first when no token is set
- CDN path handles main/master branch fallback
- Authenticated requests and enterprise hosts still use Contents API directly
- Add 10 new tests for CDN download path + 2 tests for URL builder

Co-authored-by: danielmeppiel <51440732+danielmeppiel@users.noreply.github.com>
…e review

Co-authored-by: danielmeppiel <51440732+danielmeppiel@users.noreply.github.com>
assert result == b'# Agent content'
# Should have hit raw.githubusercontent.com, not API
call_url = mock_get.call_args[0][0]
assert 'raw.githubusercontent.com' in call_url

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High test

The string
raw.githubusercontent.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI about 17 hours ago

In general, to avoid incomplete URL substring sanitization, the code should parse the URL and inspect its components (typically the hostname) instead of checking for a substring within the entire URL string. This avoids false matches where the trusted host appears in the path, query, or as part of another hostname.

For this specific test in tests/test_github_downloader.py, we should stop using 'raw.githubusercontent.com' in call_url and 'api.github.com' not in call_url, and instead parse call_url with urllib.parse.urlparse, then assert on parsed.hostname. This maintains the existing functional intent—verifying which host was used—while avoiding substring matching on the full URL string. The file already imports urlparse on line 10, so no new imports are needed.

Concretely:

  • In TestRawContentCDNDownload.test_raw_cdn_used_for_github_com_without_token, replace:
    call_url = mock_get.call_args[0][0]
    assert 'raw.githubusercontent.com' in call_url
    assert 'api.github.com' not in call_url
    with:
    call_url = mock_get.call_args[0][0]
    parsed = urlparse(call_url)
    assert parsed.hostname == 'raw.githubusercontent.com'
  • In TestRawContentCDNDownload.test_raw_cdn_not_used_when_token_present, replace:
    call_url = mock_get.call_args[0][0]
    assert 'api.github.com' in call_url
    with:
    call_url = mock_get.call_args[0][0]
    parsed = urlparse(call_url)
    assert parsed.hostname == 'api.github.com'

No other changes or new helpers are required.

Suggested changeset 1
tests/test_github_downloader.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tests/test_github_downloader.py b/tests/test_github_downloader.py
--- a/tests/test_github_downloader.py
+++ b/tests/test_github_downloader.py
@@ -1291,8 +1291,8 @@
             assert result == b'# Agent content'
             # Should have hit raw.githubusercontent.com, not API
             call_url = mock_get.call_args[0][0]
-            assert 'raw.githubusercontent.com' in call_url
-            assert 'api.github.com' not in call_url
+            parsed = urlparse(call_url)
+            assert parsed.hostname == 'raw.githubusercontent.com'
 
     def test_raw_cdn_not_used_when_token_present(self):
         """Authenticated requests should go straight to Contents API."""
@@ -1314,7 +1314,8 @@
             assert result == b'# Agent content'
             # Should use API with auth, not raw CDN
             call_url = mock_get.call_args[0][0]
-            assert 'api.github.com' in call_url
+            parsed = urlparse(call_url)
+            assert parsed.hostname == 'api.github.com'
 
     def test_raw_cdn_not_used_for_enterprise_host(self):
         """Enterprise hosts should use API directly (no raw.githubusercontent.com)."""
EOF
@@ -1291,8 +1291,8 @@
assert result == b'# Agent content'
# Should have hit raw.githubusercontent.com, not API
call_url = mock_get.call_args[0][0]
assert 'raw.githubusercontent.com' in call_url
assert 'api.github.com' not in call_url
parsed = urlparse(call_url)
assert parsed.hostname == 'raw.githubusercontent.com'

def test_raw_cdn_not_used_when_token_present(self):
"""Authenticated requests should go straight to Contents API."""
@@ -1314,7 +1314,8 @@
assert result == b'# Agent content'
# Should use API with auth, not raw CDN
call_url = mock_get.call_args[0][0]
assert 'api.github.com' in call_url
parsed = urlparse(call_url)
assert parsed.hostname == 'api.github.com'

def test_raw_cdn_not_used_for_enterprise_host(self):
"""Enterprise hosts should use API directly (no raw.githubusercontent.com)."""
Copilot is powered by AI and may make mistakes. Always verify output.
assert result == b'# Agent content'
# Should use API with auth, not raw CDN
call_url = mock_get.call_args[0][0]
assert 'api.github.com' in call_url

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High test

The string
api.github.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI about 17 hours ago

In general, to avoid incomplete URL substring sanitization, the code should parse the URL and compare the parsed hostname (and optionally scheme and path) instead of checking for string containment. For this specific test, we can keep the semantics—verifying that the request targets the GitHub API—while using urllib.parse.urlparse to assert that the hostname is api.github.com, rather than checking that 'api.github.com' appears somewhere in the full URL string.

Concretely, in tests/test_github_downloader.py, within TestGitHubPackageDownloaderIntegration.test_raw_cdn_not_used_when_token_present, replace:

call_url = mock_get.call_args[0][0]
assert 'api.github.com' in call_url

with code that parses the URL and inspects its hostname:

call_url = mock_get.call_args[0][0]
parsed_url = urlparse(call_url)
assert parsed_url.hostname == 'api.github.com'

The file already imports urlparse at the top (from urllib.parse import urlparse), so no new imports are required. This change does not alter the intended behavior of the test; it simply makes the assertion more precise and immune to the substring issue.

Suggested changeset 1
tests/test_github_downloader.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tests/test_github_downloader.py b/tests/test_github_downloader.py
--- a/tests/test_github_downloader.py
+++ b/tests/test_github_downloader.py
@@ -1314,7 +1314,8 @@
             assert result == b'# Agent content'
             # Should use API with auth, not raw CDN
             call_url = mock_get.call_args[0][0]
-            assert 'api.github.com' in call_url
+            parsed_url = urlparse(call_url)
+            assert parsed_url.hostname == 'api.github.com'
 
     def test_raw_cdn_not_used_for_enterprise_host(self):
         """Enterprise hosts should use API directly (no raw.githubusercontent.com)."""
EOF
@@ -1314,7 +1314,8 @@
assert result == b'# Agent content'
# Should use API with auth, not raw CDN
call_url = mock_get.call_args[0][0]
assert 'api.github.com' in call_url
parsed_url = urlparse(call_url)
assert parsed_url.hostname == 'api.github.com'

def test_raw_cdn_not_used_for_enterprise_host(self):
"""Enterprise hosts should use API directly (no raw.githubusercontent.com)."""
Copilot is powered by AI and may make mistakes. Always verify output.
assert result == b'# Agent content'
call_url = mock_get.call_args[0][0]
assert 'raw.githubusercontent.com' not in call_url
assert 'github.mycompany.com' in call_url

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High test

The string
github.mycompany.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI about 17 hours ago

In general, the fix is to avoid checking for hostnames using substring operations on the entire URL string, and instead parse the URL and inspect its hostname (or related components) explicitly. For tests, this means parsing the URL the code under test generated, and asserting on parsed.hostname rather than using "host" in url_string.

For this specific test in tests/test_github_downloader.py, we should replace:

call_url = mock_get.call_args[0][0]
assert 'raw.githubusercontent.com' not in call_url
assert 'github.mycompany.com' in call_url

with code that uses urllib.parse.urlparse (already imported at line 10) to parse call_url and assert that the hostname is exactly github.mycompany.com and not raw.githubusercontent.com. This change keeps the test’s intent (ensuring the enterprise host is used rather than the raw CDN) while eliminating substring-based URL checks. No new imports or helpers are required; we just reuse the existing urlparse import.

Concretely, in the body of test_raw_cdn_not_used_for_enterprise_host around lines 1337–1340, we will:

  • Keep the assignment call_url = mock_get.call_args[0][0].

  • Replace the two substring assertions with:

    parsed_url = urlparse(call_url)
    assert parsed_url.hostname != 'raw.githubusercontent.com'
    assert parsed_url.hostname == 'github.mycompany.com'

This does not alter production functionality; it only strengthens the test’s check and removes the pattern triggering CodeQL.

Suggested changeset 1
tests/test_github_downloader.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tests/test_github_downloader.py b/tests/test_github_downloader.py
--- a/tests/test_github_downloader.py
+++ b/tests/test_github_downloader.py
@@ -1336,8 +1336,9 @@
 
             assert result == b'# Agent content'
             call_url = mock_get.call_args[0][0]
-            assert 'raw.githubusercontent.com' not in call_url
-            assert 'github.mycompany.com' in call_url
+            parsed_url = urlparse(call_url)
+            assert parsed_url.hostname != 'raw.githubusercontent.com'
+            assert parsed_url.hostname == 'github.mycompany.com'
 
     def test_raw_cdn_fallback_to_api_on_404(self):
         """If raw CDN returns 404, should fall through to the API path."""
EOF
@@ -1336,8 +1336,9 @@

assert result == b'# Agent content'
call_url = mock_get.call_args[0][0]
assert 'raw.githubusercontent.com' not in call_url
assert 'github.mycompany.com' in call_url
parsed_url = urlparse(call_url)
assert parsed_url.hostname != 'raw.githubusercontent.com'
assert parsed_url.hostname == 'github.mycompany.com'

def test_raw_cdn_fallback_to_api_on_404(self):
"""If raw CDN returns 404, should fall through to the API path."""
Copilot is powered by AI and may make mistakes. Always verify output.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a CDN-first strategy for unauthenticated github.com virtual-file downloads to reduce/avoid GitHub Contents API rate limiting, while keeping authenticated/enterprise/ADO behavior on the API path.

Changes:

  • Add build_raw_content_url() helper for raw.githubusercontent.com URLs.
  • Add _try_raw_download() and wire a CDN-first fast-path into _download_github_file() (with main/master fallback + API fallback).
  • Extend unit tests to cover CDN behavior and improved 403 rate-limit handling; add changelog entries.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/apm_cli/utils/github_host.py Introduces raw CDN URL builder used by the downloader.
src/apm_cli/deps/github_downloader.py Implements raw CDN best-effort fetch + rate-limit-aware retry and error messaging.
tests/unit/test_github_host.py Adds unit tests for the raw CDN URL builder.
tests/test_github_downloader.py Adds tests for CDN-first logic and 403 rate-limit handling behavior.
CHANGELOG.md Documents the fix/feature under Unreleased.
Comments suppressed due to low confidence (3)

src/apm_cli/deps/github_downloader.py:792

  • Host matching for github.com is inconsistent: earlier you use host.lower() == "github.com", but the API URL and fallback URL branches check if host == "github.com". If GITHUB_HOST (or parsed host) ever has different casing (e.g., "GitHub.com"), this will incorrectly route to the GHES /api/v3/... endpoint. Please normalize with .lower() consistently for both the main and fallback API URL builders.
        # --- Contents API path (authenticated, enterprise, or raw fallback) ---
        # Build GitHub API URL - format differs by host type
        if host == "github.com":
            api_url = f"https://api.github.com/repos/{owner}/{repo}/contents/{file_path}?ref={ref}"
        elif host.lower().endswith(".ghe.com"):
            api_url = f"https://api.{host}/repos/{owner}/{repo}/contents/{file_path}?ref={ref}"
        else:
            api_url = f"https://{host}/api/v3/repos/{owner}/{repo}/contents/{file_path}?ref={ref}"
        
        # Set up authentication headers
        headers = {
            'Accept': 'application/vnd.github.v3.raw'  # Returns raw content directly
        }
        if self.github_token:
            headers['Authorization'] = f'token {self.github_token}'
        
        # Try to download with the specified ref
        try:
            response = self._resilient_get(api_url, headers=headers, timeout=30)
            response.raise_for_status()
            return response.content
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 404:
                # Try fallback branches if the specified ref fails
                if ref not in ["main", "master"]:
                    # If original ref failed, don't try fallbacks - it might be a specific version
                    raise RuntimeError(f"File not found: {file_path} at ref '{ref}' in {dep_ref.repo_url}")
                
                # Try the other default branch
                fallback_ref = "master" if ref == "main" else "main"
                
                # Build fallback API URL
                if host == "github.com":
                    fallback_url = f"https://api.github.com/repos/{owner}/{repo}/contents/{file_path}?ref={fallback_ref}"
                elif host.lower().endswith(".ghe.com"):
                    fallback_url = f"https://api.{host}/repos/{owner}/{repo}/contents/{file_path}?ref={fallback_ref}"
                else:
                    fallback_url = f"https://{host}/api/v3/repos/{owner}/{repo}/contents/{file_path}?ref={fallback_ref}"

src/apm_cli/deps/github_downloader.py:266

  • In _resilient_get(), the rate-limit branch sleeps and continues even on the final attempt. That can add up to an extra (up to) 60s delay right before returning the last 403/429/503 response with no further retry. Consider only sleeping/continuing when attempt < max_retries - 1, and otherwise break/return immediately.

This issue also appears on line 288 of the same file.

                if is_rate_limited:
                    last_response = response
                    retry_after = response.headers.get("Retry-After")
                    reset_at = response.headers.get("X-RateLimit-Reset")
                    if retry_after:
                        try:
                            wait = min(float(retry_after), 60)
                        except (TypeError, ValueError):
                            # Retry-After may be an HTTP-date; fall back to exponential backoff
                            wait = min(2 ** attempt, 30) * (0.5 + random.random())
                    elif reset_at:
                        try:
                            wait = max(0, min(int(reset_at) - time.time(), 60))
                        except (TypeError, ValueError):
                            wait = min(2 ** attempt, 30) * (0.5 + random.random())
                    else:
                        wait = min(2 ** attempt, 30) * (0.5 + random.random())
                    _debug(f"Rate limited ({response.status_code}), retry in {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
                    time.sleep(wait)
                    continue

src/apm_cli/deps/github_downloader.py:296

  • _resilient_get() returns last_response if any rate-limited response occurred, even if later attempts failed with ConnectionError/Timeout (because last_exc is ignored when last_response is set). This can mask real network failures as a rate-limit response. Please ensure that when the final outcome is an exception, that exception is raised (or otherwise prefer last_exc over a stale last_response).
        # If rate limiting exhausted all retries, return the last response so
        # callers can inspect headers (e.g. X-RateLimit-Remaining) and raise
        # an appropriate user-facing error.
        if last_response is not None:
            return last_response

        if last_exc:
            raise last_exc
        raise requests.exceptions.RequestException(f"All {max_retries} attempts failed for {url}")

Returns:
str: ``https://raw.githubusercontent.com/{owner}/{repo}/{ref}/{file_path}``
"""
return f"https://raw.githubusercontent.com/{owner}/{repo}/{ref}/{file_path}"
Comment on lines +20 to +23
### Fixed

- GitHub API rate-limit 403 responses no longer misdiagnosed as authentication failures — unauthenticated users now see actionable "rate limit exceeded" guidance instead of misleading "private repository" errors
- Virtual file downloads from public github.com repos no longer require authentication — uses `raw.githubusercontent.com` CDN (no rate limit) before falling back to the Contents API
danielmeppiel added a commit that referenced this pull request Mar 16, 2026
… deps

Combines CDN fast-path (#322) with git credential fill fallback (#332):

- raw.githubusercontent.com CDN first for unauthenticated github.com downloads
- git credential fill fallback when no env token is set
- GH_TOKEN recognition in token precedence
- GitHub API rate-limit 403 detection (not misdiagnosed as auth failure)
- Per-host token resolution for enterprise/custom domains
- URL-encode refs in CDN URLs for branches with slashes
- GIT_ASKPASS=echo hardening for non-interactive credential fill
- Improved error messages suggesting gh auth login

Closes #319, #331

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel
Copy link
Collaborator

Closing — these changes (CDN-first download + rate-limit 403 detection) have been merged into #332 along with the credential-fill fallback, addressing all review comments from both PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add CDN retrieval for individual file paths in APM dependencies

3 participants