Skip to content

geotiff: sidecar download honours max_cloud_bytes#2123

Merged
brendancol merged 3 commits into
mainfrom
deep-sweep-security-geotiff-2026-05-19-01
May 19, 2026
Merged

geotiff: sidecar download honours max_cloud_bytes#2123
brendancol merged 3 commits into
mainfrom
deep-sweep-security-geotiff-2026-05-19-01

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #2121.

Summary

load_sidecar downloaded the sibling .tif.ovr over HTTP and fsspec with
no byte cap. The base-file max_cloud_bytes budget that read_to_array
and _CloudSource enforce was bypassed entirely, so a hostile server
could serve a tiny base TIFF (which passes the cloud-budget check) plus a
multi-GB <base>.tif.ovr sidecar; opening the file with overview_level >= 1
pulled the full sidecar body into memory and OOMed the process.

Changes

  • xrspatial/geotiff/_sidecar.py: load_sidecar accepts max_cloud_bytes
    and applies it on both remote transports.
    • HTTP path forwards the cap to _HTTPSource.read_all(max_bytes=...),
      reusing the existing streaming overshoot detector (which raises
      OSError when the body exceeds the cap).
    • fsspec path stats the sidecar via fs.size() and raises
      CloudSizeLimitError when the declared size exceeds the budget,
      mirroring the _CloudSource guard at _reader.py:3239-3260.
  • xrspatial/geotiff/_reader.py: read_to_array resolves the cloud
    budget once at the top so both the base-file size guard and the
    sidecar fetch see the same cap. The sidecar load now passes the
    resolved budget through.
  • max_cloud_bytes=None preserves the legacy unbounded behaviour,
    matching the base-file semantics.
  • The GPU eager and dask-metadata call sites only see local file paths
    (via _FileSource and the _read_geo_info local branch), so they fall
    through to the mmap path and inherit the default max_cloud_bytes=None.

Backend coverage

  • numpy: covered (sidecar via read_to_array).
  • cupy: covered (GPU sidecar is local-mmap only; no remote download to
    cap).
  • dask + numpy: covered (dask metadata helper hits the local-mmap path).
  • dask + cupy: covered (chunked GPU path does not currently read
    sidecars, per the comment in _backends/gpu.py).

Test plan

  • pytest xrspatial/geotiff/tests/test_sidecar_max_cloud_bytes_2121.py
    (8 tests covering fsspec size guard, HTTP streaming cap,
    max_cloud_bytes=None unbounded path, local mmap passthrough, and
    end-to-end propagation from read_to_array to load_sidecar).
  • pytest xrspatial/geotiff/tests/test_sidecar_ovr_2112.py (28
    existing sidecar tests pass unchanged).
  • pytest -k "cloud or sidecar" xrspatial/geotiff/tests/ (83 tests
    pass, no regressions).

`load_sidecar` downloaded the sibling `.tif.ovr` over HTTP and fsspec
with no byte cap. The base-file `max_cloud_bytes` budget that
`read_to_array` and `_CloudSource` enforce was bypassed entirely, so a
hostile server could serve a tiny base TIFF that passes the cloud
budget plus a multi-GB sidecar; opening with `overview_level >= 1`
pulled the full sidecar body into memory and OOMed the process.

Thread `max_cloud_bytes` through `load_sidecar` and apply it on both
transports:

- HTTP path forwards the cap to `_HTTPSource.read_all(max_bytes=...)`,
  reusing the existing streaming overshoot detector.
- fsspec path stats the sidecar via `fs.size()` and raises
  `CloudSizeLimitError` when the declared size exceeds the budget,
  mirroring the `_CloudSource` guard at `_reader.py:3239-3260`.

`read_to_array` now resolves the cloud budget once at the top of the
function so both the base-file size guard and the sidecar fetch see the
same effective cap. `max_cloud_bytes=None` preserves the legacy
unbounded behaviour. The local-file mmap path is unchanged.

The GPU and dask-metadata sidecar call sites only see local file paths
(via `_FileSource` and the `_read_geo_info` local branch respectively),
so they fall through to the mmap branch and inherit the default
`max_cloud_bytes=None` for free.

Tests cover the fsspec size guard, the HTTP streaming cap, the
`max_cloud_bytes=None` unbounded path, the local mmap passthrough, and
end-to-end propagation from `read_to_array` into `load_sidecar` with a
sidecar inflated past the base file's size.

Closes #2121.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 19, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Domain-aware review (post-merge audit)

Automated self-review of the sidecar max_cloud_bytes fix.

Blockers

None.

Suggestions

  1. Asymmetric exception type across transports. The fsspec path raises
    CloudSizeLimitError from a pre-flight fs.size() check; the HTTP path
    raises OSError from the streaming overshoot detector inside
    _HTTPSource.read_all. Both are valid for the threat model (fsspec
    knows the size up front, HTTP may not), and the docstring now spells
    the asymmetry out. Callers who catch CloudSizeLimitError will miss
    the HTTP path. Consider whether a thin wrapper around the HTTP read
    should translate OSError to CloudSizeLimitError when the cause is
    the streaming cap. Filing as a follow-up would be fine if the
    asymmetry is intentional for now.

  2. fsspec.core.url_to_fs vs fsspec.open. The fsspec branch now
    opens the filesystem twice (once for size, once for read). The change
    from fsspec.open(path) to url_to_fs + fs.open(fs_path) is needed
    to share the filesystem handle, but it slightly changes the open path.
    The test suite passes for file:// URIs; the production fsspec
    schemes (s3://, gs://, az://, ...) are exercised elsewhere but
    not in this PR's tests. Worth a smoke test against at least one
    non-file:// backend (e.g. memory://) if available.

Nits

  • The end-to-end test inflates the sidecar by appending zero bytes to
    force sidecar_size > base_size. The TIFF parser tolerates trailing
    garbage past the IFD chain, so this works, but a monkey-patched
    fs.size() would be a more direct way to express the contract. The
    current approach has the virtue of exercising the real disk path.

  • The docstring's "Issue #2121" anchor appears in three places
    (Parameters, HTTP comment, fsspec comment). Slight redundancy, but
    consistent with the rest of the module's commenting style.

Coverage

  • 8 new tests in test_sidecar_max_cloud_bytes_2121.py covering the
    fsspec size guard, HTTP streaming cap, max_cloud_bytes=None
    unbounded path, local-mmap passthrough, and end-to-end propagation
    from read_to_array into load_sidecar.
  • 28 existing tests in test_sidecar_ovr_2112.py pass unchanged.
  • 83 tests pass under pytest -k "cloud or sidecar" with no
    regressions.

Dispositions

The asymmetric-exception suggestion is being deferred to a follow-up
because aligning the two transports cleanly requires a small refactor
of _HTTPSource.read_all (currently raises raw OSError from inside
the urllib3 stream loop) that is outside this PR's scope. The
memory:// fsspec coverage nit is also being deferred -- the existing
test_sidecar_ovr_2112.py::test_find_sidecar_fsspec_probe_returns_uri_when_present
already exercises file:// end-to-end; broader fsspec backend coverage
belongs with the golden-corpus matrix (#1930).

Address review feedback on #2121:

1. ``load_sidecar`` now translates the ``OSError`` raised by
   ``_HTTPSource.read_all`` budget guards into ``CloudSizeLimitError``
   so the HTTP and fsspec branches surface the same exception type for
   "sidecar exceeds the cap". The original ``OSError`` is preserved as
   ``__cause__``. Non-budget HTTP failures (connection reset, DNS,
   etc.) still propagate as ``OSError``.

2. Add ``test_env_var_propagates_to_sidecar`` to pin that
   ``XRSPATIAL_GEOTIFF_MAX_CLOUD_BYTES`` (no kwarg) caps the sidecar
   the same way an explicit ``max_cloud_bytes`` kwarg does.

The existing HTTP overshoot test now asserts ``CloudSizeLimitError``
and that ``__cause__`` retains the underlying ``OSError`` detail.
@brendancol
Copy link
Copy Markdown
Contributor Author

PR Review: geotiff: sidecar download honours max_cloud_bytes

Blockers

None.

Suggestions (addressed in follow-up commit 77e5bb6)

  • HTTP and fsspec branches surfaced different exception types for the same failure mode. The HTTP path raised OSError from the streaming overshoot detector; the fsspec path raised CloudSizeLimitError. load_sidecar now catches the budget OSError from _HTTPSource.read_all (identified by the stable "byte budget" marker) and re-raises as CloudSizeLimitError, preserving the original error via __cause__. Non-budget HTTP failures (connection reset, DNS, etc.) still propagate as OSError.
  • The kwarg path was tested but the env-var path was not. Added test_env_var_propagates_to_sidecar to pin that XRSPATIAL_GEOTIFF_MAX_CLOUD_BYTES set on the source flows through _resolve_max_cloud_bytes into load_sidecar exactly the way an explicit kwarg does.

Nits (left for a future cleanup, not blocking)

  • CloudSizeLimitError formatting is duplicated between _sidecar.py and _reader.py. A _format_cloud_size_error helper would centralize the message text for the next tweak.
  • _start_http_server in test_sidecar_max_cloud_bytes_2121.py:49 calls httpd.shutdown() but not httpd.server_close(). The listening socket lingers until process exit; the existing test_sidecar_ovr_2112.py follows the same pattern, so this is at least consistent.

What looks good

  • Single source of truth: cloud_budget = _resolve_max_cloud_bytes(...) resolves once at the top of read_to_array and flows into both the base-file _CloudSource guard and the sidecar fetch. No risk of the two paths diverging.
  • New parameter is keyword-only with None default. PR's claim that the other call sites (__init__.py:238, _backends/gpu.py:311) only ever feed local paths into load_sidecar checks out: the metadata helper returns early on fsspec URIs before reaching sidecar discovery, and the GPU path uses _FileSource which only accepts local files. Both fall through to the mmap branch where max_cloud_bytes is correctly ignored.
  • HTTP path reuses the existing streaming overshoot detector at _reader.py:1284-1307. The cap is enforced both via the Content-Length pre-check and a per-chunk running total, so a server that omits or lies about Content-Length still trips the limit.
  • fsspec branch mirrors the _CloudSource size guard at _reader.py:3239-3265 rather than inventing a new pattern.
  • Tests now cover: local mmap passthrough, fsspec size guard, HTTP streaming cap, max_cloud_bytes=None unbounded path on both transports, end-to-end propagation from read_to_array through to load_sidecar, and the env-var path. 9 tests in the new file plus 28 existing sidecar tests pass unchanged, 84 combined cloud+sidecar tests green.

Checklist

  • Algorithm matches behaviour described in geotiff: .tif.ovr sidecar download bypasses max_cloud_bytes #2121
  • All sidecar paths from every backend traced and confirmed
  • NaN handling: N/A (byte-cap fix, not a numerical change)
  • Edge cases covered: unknown size, sub-budget, equal budget, None budget, local mmap, env-var fallback
  • No premature materialization (streaming HTTP path; fsspec stat is O(1))
  • Benchmark: not needed (security guard)
  • README feature matrix: N/A (no new public function)
  • Docstrings present and accurate

@brendancol brendancol merged commit 91532a3 into main May 19, 2026
1 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: .tif.ovr sidecar download bypasses max_cloud_bytes

1 participant