Apply tile byte cap to local files + SSRF hardening on HTTP geotiff reads (#1664)#1668
Merged
Conversation
…HTTP source (#1664) Local-file reads now check ``TileByteCounts`` and ``StripByteCounts`` against the same ``XRSPATIAL_COG_MAX_TILE_BYTES`` cap that HTTP uses. A small compressed slice can still expand into gigabytes under deflate/zstd/lzw, so mmap-bounded slicing alone is not enough. ``_HTTPSource`` now validates URLs at construction: - Scheme allow-list (http, https). Widen via ``XRSPATIAL_GEOTIFF_ALLOWED_SCHEMES``. - Hostnames resolved through ``socket.getaddrinfo``; any loopback, link-local, RFC1918 private, multicast, or reserved IP is rejected. Override via ``XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1``. Rejects on any resolved IP being unsafe so DNS-rebind tricks fail closed. - Explicit connect (10 s) and read (30 s) timeouts. Override via ``XRSPATIAL_GEOTIFF_HTTP_CONNECT_TIMEOUT`` and ``XRSPATIAL_GEOTIFF_HTTP_READ_TIMEOUT``. - Redirect cap of 5 set on the urllib3 Retry config. New ``UnsafeURLError`` (a ``ValueError`` subclass) is re-exported on ``xrspatial.geotiff``. Existing handlers that catch ``ValueError`` keep working. The local-server test in ``test_cog_http_concurrent.py`` and the old ``test_local_path_unaffected_by_cap`` test are updated to reflect the new behavior; the former sets ``XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1`` for its 127.0.0.1 loop, the latter no longer pins the broken cap-bypass.
- ``test_local_tile_byte_cap_1664.py``: tiled and stripped TIFFs with fabricated huge ``TileByteCounts`` / ``StripByteCounts`` raise the cap. Env override and default cap headroom both exercised. - ``test_ssrf_hardening_1664.py``: scheme allow-list, IPv4 and IPv6 private/loopback/link-local rejection, env-var escape hatch, DNS-rebind partial-private rejection, timeout env-var parsing, and ``_HTTPSource.__init__`` integration. ``socket.getaddrinfo`` is monkeypatched per-test so no real DNS or HTTP is involved. - ``docs/source/reference/geotiff.rst``: new section documenting the cap, the SSRF defenses, every env-var override, and the ``UnsafeURLError`` exception type. Notes the ``XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS`` escape hatch for local test servers.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the GeoTIFF reader by extending the existing per-tile compressed byte cap to local-file reads and adding SSRF defenses for HTTP-based reads via _HTTPSource.
Changes:
- Apply
XRSPATIAL_COG_MAX_TILE_BYTESenforcement to local_read_tilesand_read_stripsto reduce decompression/OOM DoS risk. - Add URL validation, private-host blocking, and configurable timeouts for
_HTTPSource, plus exportUnsafeURLErroras part of the publicxrspatial.geotiffAPI. - Add documentation and extensive new tests for the local cap behavior and SSRF protections; update existing HTTP-concurrency tests to use the private-host escape hatch.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
xrspatial/geotiff/_reader.py |
Adds SSRF URL validation + timeout handling for _HTTPSource; applies tile/strip byte caps to local reads. |
xrspatial/geotiff/__init__.py |
Re-exports UnsafeURLError via the public geotiff API. |
docs/source/reference/geotiff.rst |
Documents the new security/I/O limits and environment variable controls. |
xrspatial/geotiff/tests/test_ssrf_hardening_1664.py |
New unit tests for SSRF validation logic and _HTTPSource integration. |
xrspatial/geotiff/tests/test_local_tile_byte_cap_1664.py |
New tests ensuring local tile/strip byte caps are enforced and configurable. |
xrspatial/geotiff/tests/test_security.py |
Updates an existing test to reflect that the cap now applies to local reads. |
xrspatial/geotiff/tests/test_features.py |
Ensures UnsafeURLError is tracked as part of the supported public API surface. |
xrspatial/geotiff/tests/test_cog_http_concurrent.py |
Updates fixture to set XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1 for loopback HTTP tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
599
to
631
| def __init__(self, url: str): | ||
| # SSRF defense (issue #1664): validate scheme / host *before* | ||
| # any network call. UnsafeURLError subclasses ValueError so | ||
| # callers that already catch ValueError keep working. The check | ||
| # is best-effort -- DNS results can change between validate | ||
| # time and connect time, but rejecting at construction blocks | ||
| # the vast majority of static SSRF payloads. | ||
| _validate_http_url(url) | ||
| self._url = url | ||
| self._size = None | ||
| self._pool = _get_http_pool() | ||
| self._connect_timeout = _http_connect_timeout() | ||
| self._read_timeout = _http_read_timeout() | ||
|
|
||
| def _urllib3_timeout(self): | ||
| """Build a urllib3 Timeout object lazily. | ||
|
|
||
| Imported here so that the module-level import of urllib3 stays | ||
| optional (we fall back to stdlib if urllib3 is missing). | ||
| """ | ||
| import urllib3 | ||
| return urllib3.Timeout( | ||
| connect=self._connect_timeout, read=self._read_timeout) | ||
|
|
||
| def read_range(self, start: int, length: int) -> bytes: | ||
| end = start + length - 1 | ||
| if self._pool is not None: | ||
| resp = self._pool.request( | ||
| 'GET', self._url, | ||
| headers={'Range': f'bytes={start}-{end}'}, | ||
| timeout=self._urllib3_timeout(), | ||
| redirect=_HTTP_MAX_REDIRECTS, | ||
| ) |
Comment on lines
623
to
641
| def read_range(self, start: int, length: int) -> bytes: | ||
| end = start + length - 1 | ||
| if self._pool is not None: | ||
| resp = self._pool.request( | ||
| 'GET', self._url, | ||
| headers={'Range': f'bytes={start}-{end}'}, | ||
| timeout=self._urllib3_timeout(), | ||
| redirect=_HTTP_MAX_REDIRECTS, | ||
| ) | ||
| return resp.data | ||
| # Fallback: stdlib | ||
| # Fallback: stdlib. urlopen's ``timeout`` is a single value, so | ||
| # use the more conservative read timeout; the connect timeout | ||
| # isn't separately controllable on stdlib urllib. | ||
| req = urllib.request.Request( | ||
| self._url, | ||
| headers={'Range': f'bytes={start}-{end}'}, | ||
| ) | ||
| with urllib.request.urlopen(req) as resp: | ||
| with urllib.request.urlopen(req, timeout=self._read_timeout) as resp: | ||
| return resp.read() |
Comment on lines
+331
to
+343
| """Return the set of URL schemes ``_HTTPSource`` accepts. | ||
|
|
||
| ``XRSPATIAL_GEOTIFF_ALLOWED_SCHEMES`` (comma-separated) overrides the | ||
| default. The env var only widens the allow-list; ``http`` and ``https`` | ||
| stay implicit. | ||
| """ | ||
| raw = _os_module.environ.get('XRSPATIAL_GEOTIFF_ALLOWED_SCHEMES') | ||
| if not raw: | ||
| return _HTTP_ALLOWED_SCHEMES_DEFAULT | ||
| extras = tuple( | ||
| s.strip().lower() for s in raw.split(',') if s.strip() | ||
| ) | ||
| return _HTTP_ALLOWED_SCHEMES_DEFAULT + extras |
| any tile or strip whose declared size exceeds the cap. | ||
|
|
||
| * Default: 256 MiB | ||
| * Override: ``XRSPATIAL_COG_MAX_TILE_BYTES`` (positive integer, bytes) |
Comment on lines
+272
to
+279
| """The top-level dispatcher refuses file:// URLs via _HTTPSource.""" | ||
| # ``file://`` does not match the http(s) prefix in _open_source, so | ||
| # it does NOT hit _HTTPSource at all -- it gets routed via the path | ||
| # branch which interprets the URL literally and fails to find the | ||
| # file. The relevant guarantee is just: arbitrary local file access | ||
| # via ``file://`` URL does not succeed quietly. | ||
| from xrspatial.geotiff._reader import read_to_array | ||
| with pytest.raises((ValueError, FileNotFoundError, OSError)): |
Comment on lines
+51
to
+53
| * Scheme allow-list: ``http`` and ``https`` only. Widen via | ||
| ``XRSPATIAL_GEOTIFF_ALLOWED_SCHEMES`` (comma-separated list, e.g. | ||
| ``"ftp,gopher"``). |
Merge origin/main into the SSRF-hardening branch and address the Copilot review comments. Redirect handling (review #1, #2): - urllib3 path: stop delegating redirect-follow to urllib3. _HTTPSource now GETs with redirect=False and walks the 3xx chain in Python, re-validating each Location against _validate_http_url. Caps the chain at _HTTP_MAX_REDIRECTS (5). A public URL can no longer 3xx into a loopback or private IP. - stdlib path: install _ValidatingRedirectHandler on a module-level opener so the stdlib fallback gets the same per-hop re-validation and redirect cap. read_range / read_all both go through it now. Scheme allow-list (review #3): - Drop XRSPATIAL_GEOTIFF_ALLOWED_SCHEMES. _HTTPSource is a urllib3 / urllib Range-GET implementation; widening the validator without widening the source just moved the failure to connect time. fsspec handles every other scheme:// via _open_source's _CloudSource branch. Env-var clamp (review #4): - _max_tile_bytes_from_env now falls back to the default for zero or negative values, matching the timeout helpers. The previous max(1, val) clamp silently rejected every tile on a typoed XRSPATIAL_COG_MAX_TILE_BYTES=-1. Docs and comments (review #5, #6): - "Security and I/O limits" docs: explain that non-http(s) schemes dispatch via fsspec and that the redirect cap re-validates each hop. Drop the misleading XRSPATIAL_GEOTIFF_ALLOWED_SCHEMES example. - test_read_to_array_rejects_file_url: comment corrected to match _open_source routing file:// through fsspec, and the expected- exception list now includes ImportError. Merge conflict resolution against origin/main: - __init__.py and test_features.py: keep both GeoTIFFFallbackWarning (#1662) and UnsafeURLError in __all__ / TestPublicAPI.expected. - _reader.py _read_strips: byte-cap (this PR) and RowsPerStrip<=0 guard (#1666) both kept; cap runs first. - docs/geotiff.rst: "Security and I/O limits" and "Strict mode" live as sibling sections. xrspatial/geotiff/tests/: 1611 passed (3 pre-existing matplotlib palette failures unchanged).
This was referenced May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1664.
Summary
Two hardening fixes in
xrspatial/geotiff/_reader.py:XRSPATIAL_COG_MAX_TILE_BYTEScap (introduced for HTTP COGs in Cap per-tile compressed byte_count in HTTP COG reader (DoS) #1536) now also applies to local-file_read_tilesand_read_strips. mmap slicing is bounded by file size, but a small compressed slice can still balloon into gigabytes under deflate / zstd / lzw, so the decompressor was the OOM trigger before this change._HTTPSourcenow validates URLs at construction. Scheme allow-list, host filtering against loopback / link-local / RFC1918 / multicast / reserved IPs (resolved throughsocket.getaddrinfo), explicit connect (10 s) and read (30 s) timeouts, and a redirect cap of 5.New
UnsafeURLError(aValueErrorsubclass) is exported onxrspatial.geotiffso callers can catch SSRF rejections specifically.Backward compatibility
This is the breaking part:
open_geotiff('http://127.0.0.1:...')(and any other loopback / private-IP URL) is rejected by default starting with this PR. This is common in tests and local dev. Two options:XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1for the duration of the call or the test run. The flag is documented indocs/source/reference/geotiff.rst.monkeypatch.setenv('XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS', '1'). The existing local-server test intest_cog_http_concurrent.pyshows the pattern.The existing
test_local_path_unaffected_by_captest was pinning the now-broken behavior in place (cap=1 on local read must succeed). That test has been replaced withtest_local_path_respects_default_cap.Env vars
XRSPATIAL_COG_MAX_TILE_BYTESXRSPATIAL_GEOTIFF_ALLOWED_SCHEMESXRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS01to allow loopback / private IPsXRSPATIAL_GEOTIFF_HTTP_CONNECT_TIMEOUT10XRSPATIAL_GEOTIFF_HTTP_READ_TIMEOUT30Test plan
xrspatial/geotiff/tests/test_local_tile_byte_cap_1664.py— 10 tests covering tiled + stripped TIFFs with patched byte counts, env override behavior, default cap headroom.xrspatial/geotiff/tests/test_ssrf_hardening_1664.py— 37 tests covering scheme allow-list (http, https, file, gopher, ftp, env-widened), IPv4 and IPv6 private host rejection, escape hatch, DNS-rebind partial-private rejection, timeout env parsing,_HTTPSource.__init__integration.test_security.py::TestHTTPTileByteCountCapstill passes (46/46).test_cog_http_concurrent.pystill passes after fixture update (6/6).test_http_cog_coalesce.pystill passes (11/11).xrspatial/geotiff/tests/suite: 1517 pass, 7 skipped, 3 unrelated matplotlib-recursion failures that pre-date this PR.Files
xrspatial/geotiff/_reader.py— new helpers_validate_http_url,_http_allowed_schemes,_http_allow_private_hosts,_http_timeout_from_env,_ip_is_private,UnsafeURLError; cap loop in_read_tilesand_read_strips; timeout + redirect wiring in_HTTPSource.xrspatial/geotiff/__init__.py— re-exportUnsafeURLError.docs/source/reference/geotiff.rst— new "Security and I/O limits" section.