Skip to content

geotiff: require urllib3, drop stdlib HTTP fallback (#2050)#2055

Merged
brendancol merged 2 commits into
mainfrom
issue-2050
May 18, 2026
Merged

geotiff: require urllib3, drop stdlib HTTP fallback (#2050)#2055
brendancol merged 2 commits into
mainfrom
issue-2050

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #2050

_HTTPSource had two HTTP transports. The urllib3 path pins the TCP connection to the IP returned by _validate_http_url, which closes the DNS-rebinding TOCTOU window from #1846. The stdlib urllib.request fallback does not pin: it re-resolves the hostname at request time, so a hostile resolver can return a public IP at validation time and a private IP at connect time. urllib3 was not in install_requires, so a real install could end up on the unpinned path and silently lose the #1846 mitigation.

install_requires change

  • setup.cfg: added urllib3 under [options] install_requires

Removed

  • Stdlib fallback in _HTTPSource.read_range (formerly _reader.py:1048)
  • Stdlib fallback in _HTTPSource.read_all (formerly _reader.py:1222)
  • _get_stdlib_opener and the _stdlib_opener module-level cache
  • _ValidatingRedirectHandler (the urllib redirect handler)
  • try / except ImportError in _get_http_pool
  • test_stdlib_* cases in test_ssrf_hardening_1664.py; the test_urllib3_* cases still cover redirect re-validation for the path that remains

Test plan

  • New file xrspatial/geotiff/tests/test_http_no_stdlib_fallback_2050.py (12 tests) locks in: urllib3 is importable, _get_http_pool returns a PoolManager, the removed symbols are gone, the reader source no longer imports urllib.request, read_range and read_all still round-trip a mocked urllib3 pool, and setup.cfg lists urllib3
  • pytest xrspatial/geotiff/tests/ -k "http or ssrf or dns_rebinding": 161 passed
  • Pre-existing GPU failures (test_predictor2_big_endian_gpu_1517.py, test_size_param_validation_gpu_vrt_1776.py) confirmed present on main, unrelated to this PR

_HTTPSource had two transport paths: a urllib3 path that pins the TCP
connection to the IP returned by _validate_http_url, and a stdlib
urllib.request fallback that re-resolves the hostname at request time.
urllib3 was not in install_requires, so a real install could land on
the fallback and silently lose the DNS-rebinding IP pin from #1846.

This change makes urllib3 a hard install dependency, removes the stdlib
fallback in read_range and read_all, and deletes _get_stdlib_opener and
_ValidatingRedirectHandler since nothing calls them anymore. The matching
test_stdlib_* cases in test_ssrf_hardening_1664.py are removed; redirect
re-validation is still covered by the test_urllib3_* cases.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 18, 2026
Address review feedback on #2055:

- _build_pinned_connection_classes: update stale docstring that called
  urllib3 an optional import. The lazy import is still fine, but the
  reason is module-load cost, not optionality.
- test_dns_rebinding_pin_issue_1846: drop five pytest.importorskip
  ("urllib3") calls. urllib3 is required now; the skip can never trigger.
@brendancol brendancol merged commit e96d2a7 into main May 18, 2026
5 checks passed
brendancol added a commit that referenced this pull request May 18, 2026
#2050 / #2055 dropped the stdlib ``urllib.request`` HTTP fallback
and made urllib3 a hard dependency. The three
``test_read_all_stdlib_fallback_*`` cells that exercised the
fallback's byte-budget enforcement no longer have a code path to
hit; the urllib3-only equivalents above keep the contract covered.
brendancol added a commit that referenced this pull request May 18, 2026
* geotiff: bound HTTP full-image strip reads to declared byte size (#2051)

`_HTTPSource.read_all()` was unbounded: no Content-Length validation,
no streaming cap. The full-image stripped HTTP path in
`_fetch_decode_cog_http_strips` only ran `_check_dimensions` against
the TIFF header's declared pixel count, then handed the full body to
`_read_strips`. A tiny declared raster (100x100, well within
`_check_dimensions`) could still be served as a multi-gigabyte body
and the whole body was allocated before TIFF parsing rejected anything.

Fix (option b from the security report):

- `_HTTPSource.read_all()` gains a `max_bytes: int | None = None`
  parameter. When set, it rejects an oversized `Content-Length` up
  front via `OSError`, then streams the body (urllib3
  `preload_content=False` / stdlib `resp.read` in 64 KiB chunks) and
  aborts past `max_bytes + 1`. The `+ 1` is the over-shoot probe: a
  body that exactly matches the cap passes, but a server that omits
  or lies about Content-Length is caught the moment the first extra
  byte arrives.
- `_request()` gains `preload_content=False` plumbing and releases
  the connection on the redirect path so streaming responses do not
  leak pool slots.
- `_fetch_decode_cog_http_strips` computes the budget from the strip
  table (`max(offset + byte_count) + 4 MiB slack` for TIFF header /
  trailing metadata) and passes it through.
- New `_compute_full_image_byte_budget` helper handles the
  degenerate cases (missing or all-sparse strip table) by falling
  back to the per-strip safety cap.

`max_bytes=None` preserves the legacy unbounded behaviour for the
non-HTTP callers (cloud reads are already gated by
`max_cloud_bytes`; file reads are bounded by file size).

Tests cover:
- Content-Length advertised over budget -> OSError before the read.
- Server omits Content-Length entirely (chunked) -> streaming cap
  catches the over-shoot.
- Server lies low (advertises small, sends big) -> urllib3 truncates
  at the advertised length, locked in as the expected behaviour.
- Legitimate full-image COG reads still work end-to-end.
- Attacker pads a legitimate COG body with 64 MiB of zeros -> read
  rejected by either the Content-Length pre-flight or the streaming
  cap.
- Budget helper returns the strip-table maximum plus slack, and
  falls back to the per-strip cap for empty / degenerate tables.

Closes #2051.

* geotiff: address review feedback on read_all byte-budget (#2051)

- Stdlib fallback: add three tests that force ``self._pool = None`` so
  the urllib.request branch in ``read_all`` is actually exercised. One
  for the Content-Length pre-flight check, one for the streaming cap
  on chunked bodies, and one for the legitimate-body pass-through.
- Redirect on the streaming path: drain the redirect response body
  before ``release_conn`` so urllib3 can return the connection to the
  pool instead of closing it. 3xx bodies are tiny so the drain is
  cheap.
- Doc tweak in ``_check_content_length``: spell out that returning
  silently on missing / unparseable headers is intentional because
  the streaming cap is the real defence.
- Test docstring fix: the padded-body test rejects before the body
  is fully buffered, not before it is allocated at all.

* geotiff: drop stdlib-fallback tests now that urllib3 is required (#2051)

#2050 / #2055 dropped the stdlib ``urllib.request`` HTTP fallback
and made urllib3 a hard dependency. The three
``test_read_all_stdlib_fallback_*`` cells that exercised the
fallback's byte-budget enforcement no longer have a code path to
hit; the urllib3-only equivalents above keep the contract covered.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Security] HTTP fallback path bypasses DNS-rebinding IP pinning when urllib3 unavailable

1 participant