Skip to content

Apply tile byte cap to local files + SSRF hardening on HTTP geotiff reads (#1664)#1668

Merged
brendancol merged 3 commits into
mainfrom
issue-1664
May 12, 2026
Merged

Apply tile byte cap to local files + SSRF hardening on HTTP geotiff reads (#1664)#1668
brendancol merged 3 commits into
mainfrom
issue-1664

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #1664.

Summary

Two hardening fixes in xrspatial/geotiff/_reader.py:

  1. The XRSPATIAL_COG_MAX_TILE_BYTES cap (introduced for HTTP COGs in Cap per-tile compressed byte_count in HTTP COG reader (DoS) #1536) now also applies to local-file _read_tiles and _read_strips. mmap slicing is bounded by file size, but a small compressed slice can still balloon into gigabytes under deflate / zstd / lzw, so the decompressor was the OOM trigger before this change.
  2. _HTTPSource now validates URLs at construction. Scheme allow-list, host filtering against loopback / link-local / RFC1918 / multicast / reserved IPs (resolved through socket.getaddrinfo), explicit connect (10 s) and read (30 s) timeouts, and a redirect cap of 5.

New UnsafeURLError (a ValueError subclass) is exported on xrspatial.geotiff so callers can catch SSRF rejections specifically.

Backward compatibility

This is the breaking part: open_geotiff('http://127.0.0.1:...') (and any other loopback / private-IP URL) is rejected by default starting with this PR. This is common in tests and local dev. Two options:

  • Set XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1 for the duration of the call or the test run. The flag is documented in docs/source/reference/geotiff.rst.
  • For pytest fixtures, use monkeypatch.setenv('XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS', '1'). The existing local-server test in test_cog_http_concurrent.py shows the pattern.

The existing test_local_path_unaffected_by_cap test was pinning the now-broken behavior in place (cap=1 on local read must succeed). That test has been replaced with test_local_path_respects_default_cap.

Env vars

Var Default Effect
XRSPATIAL_COG_MAX_TILE_BYTES 256 MiB Per-tile / per-strip compressed-byte cap (HTTP + local)
XRSPATIAL_GEOTIFF_ALLOWED_SCHEMES (http, https) Comma-separated extra schemes
XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS 0 Set to 1 to allow loopback / private IPs
XRSPATIAL_GEOTIFF_HTTP_CONNECT_TIMEOUT 10 Seconds
XRSPATIAL_GEOTIFF_HTTP_READ_TIMEOUT 30 Seconds

Test plan

  • xrspatial/geotiff/tests/test_local_tile_byte_cap_1664.py — 10 tests covering tiled + stripped TIFFs with patched byte counts, env override behavior, default cap headroom.
  • xrspatial/geotiff/tests/test_ssrf_hardening_1664.py — 37 tests covering scheme allow-list (http, https, file, gopher, ftp, env-widened), IPv4 and IPv6 private host rejection, escape hatch, DNS-rebind partial-private rejection, timeout env parsing, _HTTPSource.__init__ integration.
  • Existing test_security.py::TestHTTPTileByteCountCap still passes (46/46).
  • Existing test_cog_http_concurrent.py still passes after fixture update (6/6).
  • Existing test_http_cog_coalesce.py still passes (11/11).
  • Full xrspatial/geotiff/tests/ suite: 1517 pass, 7 skipped, 3 unrelated matplotlib-recursion failures that pre-date this PR.

Files

  • xrspatial/geotiff/_reader.py — new helpers _validate_http_url, _http_allowed_schemes, _http_allow_private_hosts, _http_timeout_from_env, _ip_is_private, UnsafeURLError; cap loop in _read_tiles and _read_strips; timeout + redirect wiring in _HTTPSource.
  • xrspatial/geotiff/__init__.py — re-export UnsafeURLError.
  • docs/source/reference/geotiff.rst — new "Security and I/O limits" section.

…HTTP source (#1664)

Local-file reads now check ``TileByteCounts`` and ``StripByteCounts``
against the same ``XRSPATIAL_COG_MAX_TILE_BYTES`` cap that HTTP uses.
A small compressed slice can still expand into gigabytes under
deflate/zstd/lzw, so mmap-bounded slicing alone is not enough.

``_HTTPSource`` now validates URLs at construction:

- Scheme allow-list (http, https). Widen via
  ``XRSPATIAL_GEOTIFF_ALLOWED_SCHEMES``.
- Hostnames resolved through ``socket.getaddrinfo``; any loopback,
  link-local, RFC1918 private, multicast, or reserved IP is rejected.
  Override via ``XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1``. Rejects
  on any resolved IP being unsafe so DNS-rebind tricks fail closed.
- Explicit connect (10 s) and read (30 s) timeouts. Override via
  ``XRSPATIAL_GEOTIFF_HTTP_CONNECT_TIMEOUT`` and
  ``XRSPATIAL_GEOTIFF_HTTP_READ_TIMEOUT``.
- Redirect cap of 5 set on the urllib3 Retry config.

New ``UnsafeURLError`` (a ``ValueError`` subclass) is re-exported on
``xrspatial.geotiff``. Existing handlers that catch ``ValueError`` keep
working.

The local-server test in ``test_cog_http_concurrent.py`` and the old
``test_local_path_unaffected_by_cap`` test are updated to reflect the
new behavior; the former sets ``XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1``
for its 127.0.0.1 loop, the latter no longer pins the broken
cap-bypass.
- ``test_local_tile_byte_cap_1664.py``: tiled and stripped TIFFs with
  fabricated huge ``TileByteCounts`` / ``StripByteCounts`` raise the
  cap. Env override and default cap headroom both exercised.
- ``test_ssrf_hardening_1664.py``: scheme allow-list, IPv4 and IPv6
  private/loopback/link-local rejection, env-var escape hatch,
  DNS-rebind partial-private rejection, timeout env-var parsing,
  and ``_HTTPSource.__init__`` integration. ``socket.getaddrinfo``
  is monkeypatched per-test so no real DNS or HTTP is involved.
- ``docs/source/reference/geotiff.rst``: new section documenting
  the cap, the SSRF defenses, every env-var override, and the
  ``UnsafeURLError`` exception type. Notes the
  ``XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS`` escape hatch for local
  test servers.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 12, 2026
@brendancol brendancol requested a review from Copilot May 12, 2026 13:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the GeoTIFF reader by extending the existing per-tile compressed byte cap to local-file reads and adding SSRF defenses for HTTP-based reads via _HTTPSource.

Changes:

  • Apply XRSPATIAL_COG_MAX_TILE_BYTES enforcement to local _read_tiles and _read_strips to reduce decompression/OOM DoS risk.
  • Add URL validation, private-host blocking, and configurable timeouts for _HTTPSource, plus export UnsafeURLError as part of the public xrspatial.geotiff API.
  • Add documentation and extensive new tests for the local cap behavior and SSRF protections; update existing HTTP-concurrency tests to use the private-host escape hatch.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
xrspatial/geotiff/_reader.py Adds SSRF URL validation + timeout handling for _HTTPSource; applies tile/strip byte caps to local reads.
xrspatial/geotiff/__init__.py Re-exports UnsafeURLError via the public geotiff API.
docs/source/reference/geotiff.rst Documents the new security/I/O limits and environment variable controls.
xrspatial/geotiff/tests/test_ssrf_hardening_1664.py New unit tests for SSRF validation logic and _HTTPSource integration.
xrspatial/geotiff/tests/test_local_tile_byte_cap_1664.py New tests ensuring local tile/strip byte caps are enforced and configurable.
xrspatial/geotiff/tests/test_security.py Updates an existing test to reflect that the cap now applies to local reads.
xrspatial/geotiff/tests/test_features.py Ensures UnsafeURLError is tracked as part of the supported public API surface.
xrspatial/geotiff/tests/test_cog_http_concurrent.py Updates fixture to set XRSPATIAL_GEOTIFF_ALLOW_PRIVATE_HOSTS=1 for loopback HTTP tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread xrspatial/geotiff/_reader.py Outdated
Comment on lines 599 to 631
def __init__(self, url: str):
# SSRF defense (issue #1664): validate scheme / host *before*
# any network call. UnsafeURLError subclasses ValueError so
# callers that already catch ValueError keep working. The check
# is best-effort -- DNS results can change between validate
# time and connect time, but rejecting at construction blocks
# the vast majority of static SSRF payloads.
_validate_http_url(url)
self._url = url
self._size = None
self._pool = _get_http_pool()
self._connect_timeout = _http_connect_timeout()
self._read_timeout = _http_read_timeout()

def _urllib3_timeout(self):
"""Build a urllib3 Timeout object lazily.

Imported here so that the module-level import of urllib3 stays
optional (we fall back to stdlib if urllib3 is missing).
"""
import urllib3
return urllib3.Timeout(
connect=self._connect_timeout, read=self._read_timeout)

def read_range(self, start: int, length: int) -> bytes:
end = start + length - 1
if self._pool is not None:
resp = self._pool.request(
'GET', self._url,
headers={'Range': f'bytes={start}-{end}'},
timeout=self._urllib3_timeout(),
redirect=_HTTP_MAX_REDIRECTS,
)
Comment on lines 623 to 641
def read_range(self, start: int, length: int) -> bytes:
end = start + length - 1
if self._pool is not None:
resp = self._pool.request(
'GET', self._url,
headers={'Range': f'bytes={start}-{end}'},
timeout=self._urllib3_timeout(),
redirect=_HTTP_MAX_REDIRECTS,
)
return resp.data
# Fallback: stdlib
# Fallback: stdlib. urlopen's ``timeout`` is a single value, so
# use the more conservative read timeout; the connect timeout
# isn't separately controllable on stdlib urllib.
req = urllib.request.Request(
self._url,
headers={'Range': f'bytes={start}-{end}'},
)
with urllib.request.urlopen(req) as resp:
with urllib.request.urlopen(req, timeout=self._read_timeout) as resp:
return resp.read()
Comment thread xrspatial/geotiff/_reader.py Outdated
Comment on lines +331 to +343
"""Return the set of URL schemes ``_HTTPSource`` accepts.

``XRSPATIAL_GEOTIFF_ALLOWED_SCHEMES`` (comma-separated) overrides the
default. The env var only widens the allow-list; ``http`` and ``https``
stay implicit.
"""
raw = _os_module.environ.get('XRSPATIAL_GEOTIFF_ALLOWED_SCHEMES')
if not raw:
return _HTTP_ALLOWED_SCHEMES_DEFAULT
extras = tuple(
s.strip().lower() for s in raw.split(',') if s.strip()
)
return _HTTP_ALLOWED_SCHEMES_DEFAULT + extras
Comment thread docs/source/reference/geotiff.rst Outdated
any tile or strip whose declared size exceeds the cap.

* Default: 256 MiB
* Override: ``XRSPATIAL_COG_MAX_TILE_BYTES`` (positive integer, bytes)
Comment on lines +272 to +279
"""The top-level dispatcher refuses file:// URLs via _HTTPSource."""
# ``file://`` does not match the http(s) prefix in _open_source, so
# it does NOT hit _HTTPSource at all -- it gets routed via the path
# branch which interprets the URL literally and fails to find the
# file. The relevant guarantee is just: arbitrary local file access
# via ``file://`` URL does not succeed quietly.
from xrspatial.geotiff._reader import read_to_array
with pytest.raises((ValueError, FileNotFoundError, OSError)):
Comment thread docs/source/reference/geotiff.rst Outdated
Comment on lines +51 to +53
* Scheme allow-list: ``http`` and ``https`` only. Widen via
``XRSPATIAL_GEOTIFF_ALLOWED_SCHEMES`` (comma-separated list, e.g.
``"ftp,gopher"``).
Merge origin/main into the SSRF-hardening branch and address the
Copilot review comments.

Redirect handling (review #1, #2):
- urllib3 path: stop delegating redirect-follow to urllib3.
  _HTTPSource now GETs with redirect=False and walks the 3xx chain
  in Python, re-validating each Location against _validate_http_url.
  Caps the chain at _HTTP_MAX_REDIRECTS (5). A public URL can no
  longer 3xx into a loopback or private IP.
- stdlib path: install _ValidatingRedirectHandler on a module-level
  opener so the stdlib fallback gets the same per-hop re-validation
  and redirect cap. read_range / read_all both go through it now.

Scheme allow-list (review #3):
- Drop XRSPATIAL_GEOTIFF_ALLOWED_SCHEMES. _HTTPSource is a urllib3 /
  urllib Range-GET implementation; widening the validator without
  widening the source just moved the failure to connect time. fsspec
  handles every other scheme:// via _open_source's _CloudSource branch.

Env-var clamp (review #4):
- _max_tile_bytes_from_env now falls back to the default for zero or
  negative values, matching the timeout helpers. The previous
  max(1, val) clamp silently rejected every tile on a typoed
  XRSPATIAL_COG_MAX_TILE_BYTES=-1.

Docs and comments (review #5, #6):
- "Security and I/O limits" docs: explain that non-http(s) schemes
  dispatch via fsspec and that the redirect cap re-validates each
  hop. Drop the misleading XRSPATIAL_GEOTIFF_ALLOWED_SCHEMES example.
- test_read_to_array_rejects_file_url: comment corrected to match
  _open_source routing file:// through fsspec, and the expected-
  exception list now includes ImportError.

Merge conflict resolution against origin/main:
- __init__.py and test_features.py: keep both GeoTIFFFallbackWarning
  (#1662) and UnsafeURLError in __all__ / TestPublicAPI.expected.
- _reader.py _read_strips: byte-cap (this PR) and RowsPerStrip<=0
  guard (#1666) both kept; cap runs first.
- docs/geotiff.rst: "Security and I/O limits" and "Strict mode" live
  as sibling sections.

xrspatial/geotiff/tests/: 1611 passed (3 pre-existing matplotlib
palette failures unchanged).
@brendancol brendancol merged commit 919d827 into main May 12, 2026
1 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Apply tile byte cap to local files + SSRF hardening on HTTP geotiff reads

2 participants