Describe the bug
Eager reads from cloud storage via fsspec pull the entire object into memory before any TIFF header parse or max_pixels guard runs.
read_to_array() at xrspatial/geotiff/_reader.py:2925 constructs a _CloudSource for any non-HTTP :// source and immediately calls src.read_all(). _CloudSource.read_all() at xrspatial/geotiff/_reader.py:1339 does an unbounded f.read() with no size check, so a large or hostile remote TIFF on s3://, gs://, az://, or abfs:// can exhaust memory and bandwidth before the dimensions are checked.
The HTTP path already reads only what it needs via _parse_cog_http_meta and range fetches against _HTTPSource. The dask backend also bounds its metadata reads. Eager cloud is the gap.
Expected behavior
Eager cloud reads should either:
- Refuse objects larger than a caller-configurable byte budget, raising before any data is downloaded.
_CloudSource.__init__ already knows the object size from fsspec.size(), so the check is cheap.
- Or (deeper fix) reuse the range-reader path, since
_CloudSource already exposes read_range / read_ranges mirroring _HTTPSource.
The byte-budget fix is the smaller change and closes the safety hole. The range-based refactor can land separately as backend parity work.
Proposed fix
- Add a
MAX_CLOUD_BYTES_DEFAULT constant (256 MiB) in _reader.py, with XRSPATIAL_GEOTIFF_MAX_CLOUD_BYTES env override.
- Plumb a
max_cloud_bytes kwarg through read_to_array and open_geotiff. None opts out of the size check entirely.
- Before
src.read_all() in the fsspec branch, compare src.size against the budget. Refuse oversized objects with a clear error.
Additional context
Companion to PRs #1873 (HTTP / loopback test gating) and the bounded-metadata path on _HTTPSource. Surfaced by an external review of the geotiff module.
Describe the bug
Eager reads from cloud storage via fsspec pull the entire object into memory before any TIFF header parse or
max_pixelsguard runs.read_to_array()atxrspatial/geotiff/_reader.py:2925constructs a_CloudSourcefor any non-HTTP://source and immediately callssrc.read_all()._CloudSource.read_all()atxrspatial/geotiff/_reader.py:1339does an unboundedf.read()with no size check, so a large or hostile remote TIFF ons3://,gs://,az://, orabfs://can exhaust memory and bandwidth before the dimensions are checked.The HTTP path already reads only what it needs via
_parse_cog_http_metaand range fetches against_HTTPSource. The dask backend also bounds its metadata reads. Eager cloud is the gap.Expected behavior
Eager cloud reads should either:
_CloudSource.__init__already knows the object size fromfsspec.size(), so the check is cheap._CloudSourcealready exposesread_range/read_rangesmirroring_HTTPSource.The byte-budget fix is the smaller change and closes the safety hole. The range-based refactor can land separately as backend parity work.
Proposed fix
MAX_CLOUD_BYTES_DEFAULTconstant (256 MiB) in_reader.py, withXRSPATIAL_GEOTIFF_MAX_CLOUD_BYTESenv override.max_cloud_byteskwarg throughread_to_arrayandopen_geotiff.Noneopts out of the size check entirely.src.read_all()in the fsspec branch, comparesrc.sizeagainst the budget. Refuse oversized objects with a clear error.Additional context
Companion to PRs #1873 (HTTP / loopback test gating) and the bounded-metadata path on
_HTTPSource. Surfaced by an external review of the geotiff module.